Thread: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Fujii Masao
Date:
On Thu, Jan 28, 2010 at 12:27 AM, Heikki Linnakangas <heikki@postgresql.org> wrote: > Log Message: > ----------- > Make standby server continuously retry restoring the next WAL segment with > restore_command, if the connection to the primary server is lost. This > ensures that the standby can recover automatically, if the connection is > lost for a long time and standby falls behind so much that the required > WAL segments have been archived and deleted in the master. > > This also makes standby_mode useful without streaming replication; the > server will keep retrying restore_command every few seconds until the > trigger file is found. That's the same basic functionality pg_standby > offers, but without the bells and whistles. http://archives.postgresql.org/pgsql-hackers/2010-01/msg01520.php http://archives.postgresql.org/pgsql-hackers/2010-01/msg02589.php As I pointed out previously, the standby might restore a partially-filled WAL file that is being archived by the primary, and cause a FATAL error. And this happened in my box when I was testing the SR. sby [20088] FATAL: archive file "000000010000000000000087" has wrong size: 14139392 instead of 16777216 sby [20076] LOG: startup process (PID 20088) exited with exit code 1 sby [20076]LOG: terminating any other active server processes act [18164] LOG: received immediate shutdown request If the startup process is in standby mode, I think that it should retry starting replication instead of emitting an error when it finds a partially-filled file in the archive. Then if the replication has been terminated, it has only to restore the archived file again. Thought? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Fujii Masao wrote: > As I pointed out previously, the standby might restore a partially-filled > WAL file that is being archived by the primary, and cause a FATAL error. > And this happened in my box when I was testing the SR. > > sby [20088] FATAL: archive file "000000010000000000000087" has > wrong size: 14139392 instead of 16777216 > sby [20076] LOG: startup process (PID 20088) exited with exit code 1 > sby [20076] LOG: terminating any other active server processes > act [18164] LOG: received immediate shutdown request > > If the startup process is in standby mode, I think that it should retry > starting replication instead of emitting an error when it finds a > partially-filled file in the archive. Then if the replication has been > terminated, it has only to restore the archived file again. Thought? Hmm, so after running restore_command, check the file size and if it's too short, treat it the same as if restore_command returned non-zero? And it will be retried on the next iteration. Works for me, though OTOH it will then fail to complain about a genuinely WAL file that's truncated for some reason. I guess there's no way around that, even if you have a script as restore_command that does the file size check, it will have the same problem. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Fujii Masao
Date:
On Wed, Feb 10, 2010 at 4:32 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Hmm, so after running restore_command, check the file size and if it's > too short, treat it the same as if restore_command returned non-zero? Yes, only in standby mode case. OTOH I think that normal archive recovery should treat it as a FATAL error. > And it will be retried on the next iteration. Works for me, though OTOH > it will then fail to complain about a genuinely WAL file that's > truncated for some reason. I guess there's no way around that, even if > you have a script as restore_command that does the file size check, it > will have the same problem. Right. But the server in standby mode also needs to complain about that? We might be able to read completely such a WAL file that looks truncated from the primary via SR, or from the archive after a few seconds. So it's odd for me to give up continuing the standby only by finding the WAL file whose file size is short. I believe that the warm standby (+ pg_standby) also is based on that thought. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Aidan Van Dyk
Date:
* Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100210 02:33]: > Hmm, so after running restore_command, check the file size and if it's > too short, treat it the same as if restore_command returned non-zero? > And it will be retried on the next iteration. Works for me, though OTOH > it will then fail to complain about a genuinely WAL file that's > truncated for some reason. I guess there's no way around that, even if > you have a script as restore_command that does the file size check, it > will have the same problem. But isn't this something every current PITR archive already "works around"... Everybody doing PITR archives already know the importance of making the *appearance* of the WAL filename in the archive atomic. Don't docs warn about plain cp not being atomic and allowing "short" files to appear in the archive... I'm not sure why "streaming recovery" suddenly changes the requirements... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Aidan Van Dyk wrote: > * Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100210 02:33]: > >> Hmm, so after running restore_command, check the file size and if it's >> too short, treat it the same as if restore_command returned non-zero? >> And it will be retried on the next iteration. Works for me, though OTOH >> it will then fail to complain about a genuinely WAL file that's >> truncated for some reason. I guess there's no way around that, even if >> you have a script as restore_command that does the file size check, it >> will have the same problem. > > But isn't this something every current PITR archive already "works > around"... Everybody doing PITR archives already know the importance of > making the *appearance* of the WAL filename in the archive atomic. Well, pg_standby does defend against that, but you don't use pg_standby with the built-in standby mode anymore. It would be reasonable to have the same level of defenses built-in. It's essentially a one-line change, and saves a lot of trouble and risk of subtle misconfiguration for admins. > Don't docs warn about plain cp not being atomic and allowing "short" > files to appear in the archive... Hmm, I don't see anything about that at quick glance. Besides, normal PITR doesn't have a problem with that, because it will stop when it reaches the end of archived WAL anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Simon Riggs
Date:
On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote: > Fujii Masao wrote: > > As I pointed out previously, the standby might restore a partially-filled > > WAL file that is being archived by the primary, and cause a FATAL error. > > And this happened in my box when I was testing the SR. > > > > sby [20088] FATAL: archive file "000000010000000000000087" has > > wrong size: 14139392 instead of 16777216 > > sby [20076] LOG: startup process (PID 20088) exited with exit code 1 > > sby [20076] LOG: terminating any other active server processes > > act [18164] LOG: received immediate shutdown request > > > > If the startup process is in standby mode, I think that it should retry > > starting replication instead of emitting an error when it finds a > > partially-filled file in the archive. Then if the replication has been > > terminated, it has only to restore the archived file again. Thought? > > Hmm, so after running restore_command, check the file size and if it's > too short, treat it the same as if restore_command returned non-zero? > And it will be retried on the next iteration. Works for me, though OTOH > it will then fail to complain about a genuinely WAL file that's > truncated for some reason. I guess there's no way around that, even if > you have a script as restore_command that does the file size check, it > will have the same problem. Are we trying to re-invent pg_standby here? -- Simon Riggs www.2ndQuadrant.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Simon Riggs wrote: > On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote: >> Hmm, so after running restore_command, check the file size and if it's >> too short, treat it the same as if restore_command returned non-zero? >> And it will be retried on the next iteration. Works for me, though OTOH >> it will then fail to complain about a genuinely WAL file that's >> truncated for some reason. I guess there's no way around that, even if >> you have a script as restore_command that does the file size check, it >> will have the same problem. > > Are we trying to re-invent pg_standby here? That's not the goal, but we seem to need some of the same functionality in the backend now. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Simon Riggs
Date:
On Thu, 2010-02-11 at 14:22 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote: > >> Hmm, so after running restore_command, check the file size and if it's > >> too short, treat it the same as if restore_command returned non-zero? > >> And it will be retried on the next iteration. Works for me, though OTOH > >> it will then fail to complain about a genuinely WAL file that's > >> truncated for some reason. I guess there's no way around that, even if > >> you have a script as restore_command that does the file size check, it > >> will have the same problem. > > > > Are we trying to re-invent pg_standby here? > > That's not the goal, but we seem to need some of the same functionality > in the backend now. I think you need to say why... -- Simon Riggs www.2ndQuadrant.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Simon Riggs wrote: > On Thu, 2010-02-11 at 14:22 +0200, Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote: >>>> Hmm, so after running restore_command, check the file size and if it's >>>> too short, treat it the same as if restore_command returned non-zero? >>>> And it will be retried on the next iteration. Works for me, though OTOH >>>> it will then fail to complain about a genuinely WAL file that's >>>> truncated for some reason. I guess there's no way around that, even if >>>> you have a script as restore_command that does the file size check, it >>>> will have the same problem. >>> Are we trying to re-invent pg_standby here? >> That's not the goal, but we seem to need some of the same functionality >> in the backend now. > > I think you need to say why... See the quoted paragraph above. We should check the file size, so that we will not fail if the WAL file is just being copied into the archive directory. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Simon Riggs
Date:
On Thu, 2010-02-11 at 14:44 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Thu, 2010-02-11 at 14:22 +0200, Heikki Linnakangas wrote: > >> Simon Riggs wrote: > >>> On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote: > >>>> Hmm, so after running restore_command, check the file size and if it's > >>>> too short, treat it the same as if restore_command returned non-zero? > >>>> And it will be retried on the next iteration. Works for me, though OTOH > >>>> it will then fail to complain about a genuinely WAL file that's > >>>> truncated for some reason. I guess there's no way around that, even if > >>>> you have a script as restore_command that does the file size check, it > >>>> will have the same problem. > >>> Are we trying to re-invent pg_standby here? > >> That's not the goal, but we seem to need some of the same functionality > >> in the backend now. > > > > I think you need to say why... > > See the quoted paragraph above. We should check the file size, so that > we will not fail if the WAL file is just being copied into the archive > directory. We can read, but that's not an explanation. By giving terse answers in that way you are giving the impression that you don't want discussion on these points. If you were running pg_standby as the restore_command then this error wouldn't happen. So you need to explain why running pg_standby cannot solve your problem and why we must fix it by replicating code that has previously existed elsewhere. -- Simon Riggs www.2ndQuadrant.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Simon Riggs wrote: > If you were running pg_standby as the restore_command then this error > wouldn't happen. So you need to explain why running pg_standby cannot > solve your problem and why we must fix it by replicating code that has > previously existed elsewhere. pg_standby cannot be used with streaming replication. I guess you're next question is: why not? The startup process alternates between streaming, and restoring files from archive using restore_command. It will progress using streaming as long as it can, but if the connection is lost, it will try to poll the archive until the connection is established again. The startup process expects the restore_command to try to restore the file and fail if it's not found. If the restore_command goes into sleep, waiting for the file to arrive, that will defeat the retry logic in the server because the startup process won't get control again to retry establishing the connection. That's the the essence of my proposal here: http://archives.postgresql.org/message-id/4B50AFB4.4060902@enterprisedb.com which is what has now been implemented. To suppport a restore_command that does the sleeping itself, like pg_standby, would require a major rearchitecting of the retry logic. And I don't see why that'd desirable anyway. It's easier for the admin to set up using simple commands like 'cp' or 'scp', than require him/her to write scripts that handle the sleeping and retry logic. The real problem we have right now is missing documentation. It's starting to hurt us more and more every day, as more people start to test this. As shown by this thread and some other recent posts. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Dimitri Fontaine
Date:
Simon Riggs <simon@2ndQuadrant.com> writes: > If you were running pg_standby as the restore_command then this error > wouldn't happen. So you need to explain why running pg_standby cannot > solve your problem and why we must fix it by replicating code that has > previously existed elsewhere. Let me try. pg_standby will not let the server get back to streaming replication mode once it's done with driving the replay of all the WAL files available in the archive, but will have the server sits there waiting for the next file. The way we want that is implemented now is to have the server switch back and forth between replaying from the archive and streaming from the master. So we want the server to restore from the archive the same way pg_standby used to, except that if the archive does not contain the next WAL files, we want to get back to streaming. And the archive reading will resume at next network glitch. I think it's the reasonning, I hope it explains what you see happening. -- dim
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Simon Riggs
Date:
On Thu, 2010-02-11 at 15:28 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > If you were running pg_standby as the restore_command then this error > > wouldn't happen. So you need to explain why running pg_standby cannot > > solve your problem and why we must fix it by replicating code that has > > previously existed elsewhere. > > pg_standby cannot be used with streaming replication. > I guess you're next question is: why not? > > The startup process alternates between streaming, and restoring files > from archive using restore_command. It will progress using streaming as > long as it can, but if the connection is lost, it will try to poll the > archive until the connection is established again. The startup process > expects the restore_command to try to restore the file and fail if it's > not found. If the restore_command goes into sleep, waiting for the file > to arrive, that will defeat the retry logic in the server because the > startup process won't get control again to retry establishing the > connection. Why does the startup process need to regain control? Why not just let it sit and wait? Have you seen that if someone does use pg_standby or similar scripts in the restore_command that the server will never regain control in the way you hope. Would that cause a sporadic hang? The overall design was previously that the solution implementor was in charge of the archive and only they knew its characteristics. It seems strange that we will be forced to explicitly ban people from using a utility they were previously used to using and is still included with the distro. Then we implement in the server the very things the utility did. Only this time the solution implementor will not be in control. I would not be against implementing all aspects of pg_standby into the server. It would make life easier in some ways. I am against implementing only a *few* of the aspects because that leaves solution architects in a difficult position to know what to do. Please lay out some options here for discussion by the community. This seems like a difficult area and not one to be patched up quickly. > That's the the essence of my proposal here: > http://archives.postgresql.org/message-id/4B50AFB4.4060902@enterprisedb.com > which is what has now been implemented. > > To suppport a restore_command that does the sleeping itself, like > pg_standby, would require a major rearchitecting of the retry logic. And > I don't see why that'd desirable anyway. It's easier for the admin to > set up using simple commands like 'cp' or 'scp', than require him/her to > write scripts that handle the sleeping and retry logic. > > > The real problem we have right now is missing documentation. It's > starting to hurt us more and more every day, as more people start to > test this. As shown by this thread and some other recent posts. -- Simon Riggs www.2ndQuadrant.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Simon Riggs
Date:
On Thu, 2010-02-11 at 14:41 +0100, Dimitri Fontaine wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > If you were running pg_standby as the restore_command then this error > > wouldn't happen. So you need to explain why running pg_standby cannot > > solve your problem and why we must fix it by replicating code that has > > previously existed elsewhere. > > Let me try. > > pg_standby will not let the server get back to streaming replication > mode once it's done with driving the replay of all the WAL files > available in the archive, but will have the server sits there waiting > for the next file. > > The way we want that is implemented now is to have the server switch > back and forth between replaying from the archive and streaming from the > master. So we want the server to restore from the archive the same way > pg_standby used to, except that if the archive does not contain the next > WAL files, we want to get back to streaming. > > And the archive reading will resume at next network glitch. > > I think it's the reasonning, I hope it explains what you see happening. OK, thanks. One question then: how do we ensure that the archive does not grow too big? pg_standby cleans down the archive using %R. That function appears to not exist anymore. -- Simon Riggs www.2ndQuadrant.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Simon Riggs wrote: > One question then: how do we ensure that the archive does not grow too > big? pg_standby cleans down the archive using %R. That function appears > to not exist anymore. You can still use %R. Of course, plain 'cp' won't know what to do with it, so a script will then be required. We should probably provide a sample of that in the docs, or even a ready-made tool similar to pg_standby but without the waiting. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Aidan Van Dyk
Date:
* Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100211 08:29]: > To suppport a restore_command that does the sleeping itself, like > pg_standby, would require a major rearchitecting of the retry logic. And > I don't see why that'd desirable anyway. It's easier for the admin to > set up using simple commands like 'cp' or 'scp', than require him/her to > write scripts that handle the sleeping and retry logic. But colour me confused, I'm still not understanding why this is any different that with normal PITR recovery. So even with a plain "cp" in your recovery command instead of a sleep+copy (a la pg_standby, or PITR tools, or all the home-grown solutions out thery), I'm not seeing how it's going to get "half files". The only way I can see that is if you're out of disk space in your recovering pg_xlog. It's well know in PostgreSQL wal archivne - you don't just "shove" files into the archive, you make sure they appear there with the right name atomically. And if the master is only running the archive command on whole WAL files, I just don't understand this whole short wal problem. And don't try and tell me your just "poaching" files from a running cluster's pg_xlog directory, because I'm going to cry... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Simon Riggs
Date:
On Thu, 2010-02-11 at 15:55 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > One question then: how do we ensure that the archive does not grow too > > big? pg_standby cleans down the archive using %R. That function appears > > to not exist anymore. > > You can still use %R. Of course, plain 'cp' won't know what to do with > it, so a script will then be required. We should probably provide a > sample of that in the docs, or even a ready-made tool similar to > pg_standby but without the waiting. So we still need a script but it can't be pg_standby? Hmmm, OK... Might it not be simpler to add a parameter onto pg_standby? We send %s to tell pg_standby the standby_mode of the server which is calling it so it can decide how to act in each case. -- Simon Riggs www.2ndQuadrant.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Aidan Van Dyk wrote: > But colour me confused, I'm still not understanding why this is any > different that with normal PITR recovery. > > So even with a plain "cp" in your recovery command instead of a > sleep+copy (a la pg_standby, or PITR tools, or all the home-grown > solutions out thery), I'm not seeing how it's going to get "half files". If the file is just being copied to the archive when restore_command ('cp', say) is launched, it will copy a half file. That's not a problem for PITR, because PITR will end at the end of valid WAL anyway, but returning a half WAL file in standby mode is a problem. > It's well know in PostgreSQL wal archivne - you don't just "shove" files > into the archive, you make sure they appear there with the right name > atomically. And if the master is only running the archive command on > whole WAL files, I just don't understand this whole short wal problem. Yeah, if you're careful about that, then this change isn't required. But pg_standby protects against that, so I think it'd be reasonable to have the same level of protection built-in. It's not a lot of code. We could well just document that you should do that, ie. make sure the file appears in the archive atomically with the right size. > And don't try and tell me your just "poaching" files from a running > cluster's pg_xlog directory, because I'm going to cry... No :-). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Simon Riggs wrote: > Might it not be simpler to add a parameter onto pg_standby? > We send %s to tell pg_standby the standby_mode of the server which is > calling it so it can decide how to act in each case. That would work too, but it doesn't seem any simpler to me. On the contrary. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Simon Riggs
Date:
On Thu, 2010-02-11 at 16:22 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > Might it not be simpler to add a parameter onto pg_standby? > > We send %s to tell pg_standby the standby_mode of the server which is > > calling it so it can decide how to act in each case. > > That would work too, but it doesn't seem any simpler to me. On the contrary. It would mean that pg_standby would act appropriately according to the setting of standby_mode. So you wouldn't need multiple examples of use, it would all just work whatever the setting of standby_mode. Nice simple entry in the docs. We've said we need a script and pg_standby is it. Having a second script, pg_standby2, rather than adding something to the existing tools just seems confusing and easier to get wrong. I'm happy to do the patch if you like. -- Simon Riggs www.2ndQuadrant.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Aidan Van Dyk
Date:
* Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100211 09:17]: > If the file is just being copied to the archive when restore_command > ('cp', say) is launched, it will copy a half file. That's not a problem > for PITR, because PITR will end at the end of valid WAL anyway, but > returning a half WAL file in standby mode is a problem. But it can be a problem - without the last WAL (or at least enough of it) the master switched and archived, you have no guarantee of having being consistent again (I'm thinking specifically of recovering from a fresh backup) > Yeah, if you're careful about that, then this change isn't required. But > pg_standby protects against that, so I think it'd be reasonable to have > the same level of protection built-in. It's not a lot of code. This 1 check isn't, but what about the rest of the things pg_standby does. How much functionality should we bring it? Ideally, "all" of it. > We could well just document that you should do that, ie. make sure the > file appears in the archive atomically with the right size. I have to admit, today was the first time I went and re-read the PITR docs, and no, the docs don't seem to talk about that... Maybe it was just plain obvious to me because it (the atomic apperance) is something unix devloppers have always had to deal with, so it's ingrained in me. But I'm *sure* that I've seen that bandied around as common knowledge on the lists, and one of the reasons we alway see warnings about using rsync instead of plain SCP, etc. So ya, we should probably mention that somewhere in the docs. Section 24.3.6. Caveats? a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Greg Smith
Date:
Heikki Linnakangas wrote: <blockquote cite="mid:4B7412BE.5030605@enterprisedb.com" type="cite"><pre wrap="">Simon Riggswrote: </pre><blockquote type="cite"><pre wrap="">Might it not be simpler to add a parameter onto pg_standby? We send %s to tell pg_standby the standby_mode of the server which is calling it so it can decide how to act in each case. </pre></blockquote><pre wrap=""> That would work too, but it doesn't seem any simpler to me. On the contrary. </pre></blockquote><br /> I don't know thatthe ideas are mutually exclusive. Should the server internals do a basic check that the files they're about to processseem sane before processing them? Yes. Should we provide a full-featured program that knows how far back it candelete archives rather than expecting people to write their own? Yes. And a modified pg_standby seems the easiest wayto do that, since it already knows how to do the computations, among other benefits.<br /><br /> I already have a workin progress patch to pg_standby I'm racing to finish here that cleans up some of the UI warts in the program, includingthe scary sounding and avoidable warnings Selena submitted a patch to improve. I think I'm going to add this featureto it, too, whether or not it's deemed necessary. I'm really tired of example setups provided that say "just writea simple script using cp like <trivial example>" and then supporting those non-production quality messes in thefield. My scripts for this sort of thing have more error checking in them than functional code. And pg_standby alreadyhas a healthy sense of paranoia built into it.<br /><br /><pre class="moz-signature" cols="72">-- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support <a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a> <a class="moz-txt-link-abbreviated"href="http://www.2ndQuadrant.us">www.2ndQuadrant.us</a> </pre>
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Simon Riggs wrote: > On Thu, 2010-02-11 at 16:22 +0200, Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> Might it not be simpler to add a parameter onto pg_standby? >>> We send %s to tell pg_standby the standby_mode of the server which is >>> calling it so it can decide how to act in each case. >> That would work too, but it doesn't seem any simpler to me. On the contrary. > > It would mean that pg_standby would act appropriately according to the > setting of standby_mode. So you wouldn't need multiple examples of use, > it would all just work whatever the setting of standby_mode. Nice simple > entry in the docs. > > We've said we need a script and pg_standby is it. Having a second > script, pg_standby2, rather than adding something to the existing tools > just seems confusing and easier to get wrong. > > I'm happy to do the patch if you like. Well, doesn't really seem that useful to me, but I won't object if you want to do it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Euler Taveira de Oliveira
Date:
Simon Riggs escreveu: > It would mean that pg_standby would act appropriately according to the > setting of standby_mode. So you wouldn't need multiple examples of use, > it would all just work whatever the setting of standby_mode. Nice simple > entry in the docs. > +1. I like the %s idea. IMHO fixing pg_standby for SR is a must-fix. I'm foreseeing a lot of users asking why pg_standby doesn't work on SR mode ... -- Euler Taveira de Oliveira http://www.timbira.com/
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Aidan Van Dyk wrote: > * Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100211 09:17]: > >> If the file is just being copied to the archive when restore_command >> ('cp', say) is launched, it will copy a half file. That's not a problem >> for PITR, because PITR will end at the end of valid WAL anyway, but >> returning a half WAL file in standby mode is a problem. > > But it can be a problem - without the last WAL (or at least enough of > it) the master switched and archived, you have no guarantee of having > being consistent again (I'm thinking specifically of recovering from a > fresh backup) You have to wait for the last WAL file required by the backup to be archived before starting recovery. Otherwise there's no guarantee anyway. >> We could well just document that you should do that, ie. make sure the >> file appears in the archive atomically with the right size. > > I have to admit, today was the first time I went and re-read the PITR > docs, and no, the docs don't seem to talk about that... Maybe it was > just plain obvious to me because it (the atomic apperance) is something > unix devloppers have always had to deal with, so it's ingrained in me. > But I'm *sure* that I've seen that bandied around as common knowledge on > the lists, and one of the reasons we alway see warnings about using > rsync instead of plain SCP, etc. > > So ya, we should probably mention that somewhere in the docs. Section > 24.3.6. Caveats? -1. it isn't necessary for PITR. It's a new requirement for standby_mode='on', unless we add the file size check into the backend. I think we should add the file size check to the backend instead and save admins the headache. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Aidan Van Dyk wrote: > * Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100211 09:17]: > >> Yeah, if you're careful about that, then this change isn't required. But >> pg_standby protects against that, so I think it'd be reasonable to have >> the same level of protection built-in. It's not a lot of code. > > This 1 check isn't, but what about the rest of the things pg_standby > does. How much functionality should we bring it? Ideally, "all" of it. Well, how about we bite the bullet then and add enough bells and whistles to the backend that pg_standby really isn't needed anymore, and remove it from contrib? Looking at the options to pg_standby, we're not missing much: > Options: > -c copies file from archive (default) > -l links into archive (leaves file in archive) Obsolete (link mode not supported anymore) > -d generate lots of debugging output (testing only) We have DEBUG statements in the server... > -k NUMFILESTOKEEP if RESTARTWALFILE not used, removes files prior to limit > (0 keeps all) This is dangerous, and obsoleted by the RESTARTWALFILE option (%r). > -r MAXRETRIES max number of times to retry, with progressive wait > (default=3) Frankly this seems pretty useless, but it would be easy to implement > -s SLEEPTIME seconds to wait between file checks (min=1, max=60, > default=5) The sleep time in the backend is currently hard-coded at 5 s. Should we make it configurable? > -t TRIGGERFILE defines a trigger file to initiate failover (no default) We have this in the backend already. > -w MAXWAITTIME max seconds to wait for a file (0=no limit) (default=0) We don't have a timeout in the backend. Should we? (the timeout in pg_standby won't work, even if we add the option to use pg_standby with standby_mode='on' as Simon suggested) So the only major feature we're missing is the ability to clean up old files. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Aidan Van Dyk
Date:
* Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100211 12:04]: > > But it can be a problem - without the last WAL (or at least enough of > > it) the master switched and archived, you have no guarantee of having > > being consistent again (I'm thinking specifically of recovering from a > > fresh backup) > > You have to wait for the last WAL file required by the backup to be > archived before starting recovery. Otherwise there's no guarantee anyway. Right, but now define "wait for". If you pull only "half" the last WAL (because you've accepted that you *can* have short WAL files in the archive), you have the problem with PITR. Is it "wait for it to be in the archive", or "wait for it to be in the archive, and be sure that the contents satisfy some criteria". I've always made my PITR such that "in the archive" (i.e. the first moment a recovery can see it) implies that it's bit-for-bit identical to the original (or at least as bit-for-bit I can assume by checking various hashes I can afford to). I just assumed that was kind of common practice. I'm amazed that "partial WAL" files are every available in anyones archive, for anyone's restore command to actually pull. I find that scarry, and sure, probably won't regularly be noticed... But man, I'ld hate the time I need that emergency PITR restore to be the one time when it needs that WAL, pulls it slightly before the copy has finished (i.e. the master is pushing the WAL over a WAN to a 2nd site), and have my restore complete consistently... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > -1. it isn't necessary for PITR. It's a new requirement for > standby_mode='on', unless we add the file size check into the backend. I > think we should add the file size check to the backend instead and save > admins the headache. I think the file size check needs to be in the backend purely on safety grounds. Whether we make pg_standby useful for this context is a different discussion. regards, tom lane
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Aidan Van Dyk wrote: > * Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100211 12:04]: > >>> But it can be a problem - without the last WAL (or at least enough of >>> it) the master switched and archived, you have no guarantee of having >>> being consistent again (I'm thinking specifically of recovering from a >>> fresh backup) >> You have to wait for the last WAL file required by the backup to be >> archived before starting recovery. Otherwise there's no guarantee anyway. > > Right, but now define "wait for". As in don't start postmaster until the last WAL file needed by backup has been fully copied to the archive. > (because you've accepted that you *can* have short WAL files in the > archive) Only momentarily, while the copy is in progress. > I've always made my PITR such that "in the archive" (i.e. the first > moment a recovery can see it) implies that it's bit-for-bit identical to > the original (or at least as bit-for-bit I can assume by checking > various hashes I can afford to). I just assumed that was kind of common > practice. It's certainly good practice, agreed, but hasn't been absolutely required. > I'm amazed that "partial WAL" files are every available in anyones > archive, for anyone's restore command to actually pull. I find that > scarry, and sure, probably won't regularly be noticed... But man, I'ld > hate the time I need that emergency PITR restore to be the one time when > it needs that WAL, pulls it slightly before the copy has finished (i.e. > the master is pushing the WAL over a WAN to a 2nd site), and have my > restore complete consistently... It's not as dramatic as you make it sound. We're only talking about the last WAL file, and only when it's just being copied to the archive. If you have a archive_command like 'cp', and you look at the archive at the same millisecond that 'cp' runs, then yes you will see that the latest WAL file in the archive is only partially copied. It's not a problem for robustness; if you had looked one millisecond earlier you would not have seen the file there at all. Windows 'copy' command preallocates the whole file, which poses a different problem: if you look at the file while it's being copied, the file has the right length, but isn't in fact fully copied yet. I think 'rsync' has the same problem. To avoid that issue, you have to use something like copy+rename to make it atomic. There isn't much we can do in the server (or in pg_standby) to work around that, because there's no way to distinguish a file that's being copied from a fully-copied corrupt file. We do advise to set up an archive_command that doesn't overwrite existing files. That together with a partial WAL segment can cause a problem: if archive_command crashes while it's writing the file, leaving a partial file in the archive, the subsequent run of archive_command won't overwrite it and will get stuck trying. However, there's a small window for that even if you put the file into the archive atomically: if you crash just after fully copying the file, but before the .done file is created, upon restart the server will also try to copy the file to archive, find that it already exists, and fail. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Simon Riggs
Date:
On Thu, 2010-02-11 at 19:29 +0200, Heikki Linnakangas wrote: > Aidan Van Dyk wrote: > > * Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100211 09:17]: > > > >> Yeah, if you're careful about that, then this change isn't required. But > >> pg_standby protects against that, so I think it'd be reasonable to have > >> the same level of protection built-in. It's not a lot of code. > > > > This 1 check isn't, but what about the rest of the things pg_standby > > does. How much functionality should we bring it? Ideally, "all" of it. > > Well, how about we bite the bullet then and add enough bells and > whistles to the backend that pg_standby really isn't needed anymore, and > remove it from contrib? > > Looking at the options to pg_standby, we're not missing much: You're attitude to this makes me laugh, and cry. Removing pg_standby can be done and if you manage it well, I would be happy. I laugh because it seems like you think I have some ego tied up in it. It's hardly my finest hour of coding, especially since I've not had the ability to improve it much before now. Not missing *much*?! I cry because you're so blase about missing one or two essential features that took years to put in place. Keeping pg_standby at least guarantees we can do something about it when we discover you didn't do a good enough job removing it. -- Simon Riggs www.2ndQuadrant.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Simon Riggs
Date:
On Thu, 2010-02-11 at 13:08 -0500, Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > > -1. it isn't necessary for PITR. It's a new requirement for > > standby_mode='on', unless we add the file size check into the backend. I > > think we should add the file size check to the backend instead and save > > admins the headache. > > I think the file size check needs to be in the backend purely on safety > grounds. Whether we make pg_standby useful for this context is a > different discussion. Happy with that. -- Simon Riggs www.2ndQuadrant.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
"Kevin Grittner"
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > I think 'rsync' has the same problem. There is a switch you can use to create the problem under rsync, but by default rsync copies to a temporary file name and moves the completed file to the target name. -Kevin
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Garick Hamlin
Date:
On Thu, Feb 11, 2010 at 01:22:44PM -0500, Kevin Grittner wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > > > I think 'rsync' has the same problem. > > There is a switch you can use to create the problem under rsync, but > by default rsync copies to a temporary file name and moves the > completed file to the target name. > > -Kevin I don't use PITR, So I don't know any of the well understood facts about PITR with postgres, but my understanding with rsync is ... It doesn't fsync data before rename, its something like: open / write / close / rename, which could lead to zero length files on some filesystems. (are there other anomalies to worry about here?) Would that be a concern? Garick > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Fujii Masao
Date:
On Thu, Feb 11, 2010 at 11:22 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Simon Riggs wrote: >> Might it not be simpler to add a parameter onto pg_standby? >> We send %s to tell pg_standby the standby_mode of the server which is >> calling it so it can decide how to act in each case. > > That would work too, but it doesn't seem any simpler to me. On the contrary. Agreed. There could be three kinds of SR configurations. Let's think of them separately. (1) SR without restore_command (2) SR with 'cp' (3) SR with pg_standby (1) is the straightforward configuration. In this case the standby replays only the WAL files in pg_xlog directory, and starts SR when it has found the invalid record or been able to find no more WAL file. Then if SR is terminated for some reasons, the standby would periodically try to connect to the primary and start SR again. If you choose this, you don't need to care about the problem discussed on the thread. In the (2) case the standby replays the WAL files in not only pg_xlog but also the archive, and starts SR when it has found the invalid record or been able to find no more WAL file. If the archive is shared between the primary and the standby, the standby might restore the partial WAL file being archived (copied) by the primary. This could happen because 'cp' is not an atomic operation. Currently when the standby finds the WAL file whose file size is less than 16MB, it emits the FATAL error. This is the problem that I presented upthread. That is undesirable behavior, so I proposed to just treat that case the same as if no more WAL file is found. If so, the standby can start SR instead of emitting the FATAL error. (2) is useful configuration as described in Heikki's commig message. http://archives.postgresql.org/pgsql-committers/2010-01/msg00395.php (3) was unexpected configuration (at least for me). This would work fine as a *file-based* log shipping but not SR. Since pg_standby doesn't return when no more WAL file is found in the archive (i.e., it waits until the next complete WAL file is available), SR will never start. OTOH, since pg_standby treats the partial file as "nonexistence", the problem discussed on this thread doesn't happen. Questions: (A) Is my proposal for (2) reasonable? For me, Yes. (B) Should we allow (3) to work as "streaming replication"? In fact, we should create the new mode that makes pg_standbyreturn as soon as it doesn't find a complete WAL file in the archive? I agree with Heikki, i.e., don't think it's worth doing. Though pg_standby already has the capability to remove the old WAL files, we would still needthe cron job that removes them periodically because pg_standby is not executed during SR is running normally. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Simon Riggs wrote: > On Thu, 2010-02-11 at 13:08 -0500, Tom Lane wrote: >> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >>> -1. it isn't necessary for PITR. It's a new requirement for >>> standby_mode='on', unless we add the file size check into the backend. I >>> think we should add the file size check to the backend instead and save >>> admins the headache. >> I think the file size check needs to be in the backend purely on safety >> grounds. Whether we make pg_standby useful for this context is a >> different discussion. > > Happy with that. Committed such a check now. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Simon Riggs
Date:
On Fri, 2010-02-12 at 14:38 +0900, Fujii Masao wrote: > On Thu, Feb 11, 2010 at 11:22 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: > > Simon Riggs wrote: > >> Might it not be simpler to add a parameter onto pg_standby? > >> We send %s to tell pg_standby the standby_mode of the server which is > >> calling it so it can decide how to act in each case. > > > > That would work too, but it doesn't seem any simpler to me. On the contrary. > > Agreed. > > There could be three kinds of SR configurations. Let's think of them separately. > > (1) SR without restore_command > (2) SR with 'cp' > (3) SR with pg_standby Thanks for the explanation. > (1) is the straightforward configuration. In this case the standby replays only > the WAL files in pg_xlog directory, and starts SR when it has found the invalid > record or been able to find no more WAL file. Then if SR is terminated for some > reasons, the standby would periodically try to connect to the primary and start > SR again. If you choose this, you don't need to care about the problem discussed > on the thread. > > In the (2) case the standby replays the WAL files in not only pg_xlog but also > the archive, and starts SR when it has found the invalid record or been able to > find no more WAL file. If the archive is shared between the primary and the > standby, the standby might restore the partial WAL file being archived (copied) > by the primary. This could happen because 'cp' is not an atomic operation. > > Currently when the standby finds the WAL file whose file size is less than 16MB, > it emits the FATAL error. This is the problem that I presented upthread. That is > undesirable behavior, so I proposed to just treat that case the same as if no > more WAL file is found. If so, the standby can start SR instead of emitting the > FATAL error. (2) is useful configuration as described in Heikki's > commig message. > http://archives.postgresql.org/pgsql-committers/2010-01/msg00395.php > (3) was unexpected configuration (at least for me). This would work fine as a > *file-based* log shipping but not SR. Since pg_standby doesn't return when no > more WAL file is found in the archive (i.e., it waits until the next complete > WAL file is available), SR will never start. OTOH, since pg_standby treats the > partial file as "nonexistence", the problem discussed on this thread doesn't > happen. When we refer to "pg_standby" we mean any script. pg_standby is just a reference implementation of a useful script. The discussion shouldn't really focus on pg_standby, nor should we think of it as a different approach. My original question was whether we are seeking to remove pg_standby and, if so, have we implemented all of the technical features that pg_standby provides? Worryingly the answer seems to be Yes and No. I don't care if we get rid of pg_standby as long as we retain all the things it does. *Losing* features is not acceptable. > Questions: > (A) Is my proposal for (2) reasonable? For me, Yes. > (B) Should we allow (3) to work as "streaming replication"? In fact, we should > create the new mode that makes pg_standby return as soon as it doesn't find > a complete WAL file in the archive? I agree with Heikki, i.e., don't think > it's worth doing. Though pg_standby already has the capability to remove the > old WAL files, we would still need the cron job that removes them > periodically > because pg_standby is not executed during SR is running normally. Yes, I realised that myself overnight and was going to raise this point with you today. In 8.4 it is pg_standby that was responsible for clearing down the archive, which is why I suggested using pg_standby for that again. I agree that will not work. The important thing is not pg_standby but that we have a valid mechanism for clearing down the archive. If (2) is a fully supported replication method then we cannot rely on the existence of an external cron job to clear down the archive. Most importantly, that cron job would not know the point up to which to clear down the archive, the information given to pg_standby by %r. So I suggest that you have a new action that gets called after every checkpoint to clear down the archive. It will remove all files from the archive prior to %r. We can implement that as a sequence of unlink()s from within the server, or we can just call a script to do it. I prefer the latter approach. However we do it, we need something initiated by the server to maintain the archive and stop it from overflowing. -- Simon Riggs www.2ndQuadrant.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Simon Riggs wrote: > In 8.4 it is pg_standby that was responsible for clearing down the > archive, which is why I suggested using pg_standby for that again. I > agree that will not work. The important thing is not pg_standby but that > we have a valid mechanism for clearing down the archive. Good point. When streaming from the master, the standby doesn't call restore_command, and restore_command doesn't get a chance to clean up old files. > So I suggest that you have a new action that gets called after every > checkpoint to clear down the archive. It will remove all files from the > archive prior to %r. We can implement that as a sequence of unlink()s > from within the server, or we can just call a script to do it. I prefer > the latter approach. However we do it, we need something initiated by > the server to maintain the archive and stop it from overflowing. +1 -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Simon Riggs
Date:
On Fri, 2010-02-12 at 12:54 +0000, Simon Riggs wrote: > So I suggest that you have a new action that gets called after every > checkpoint to clear down the archive. It will remove all files from the > archive prior to %r. We can implement that as a sequence of unlink()s > from within the server, or we can just call a script to do it. I prefer > the latter approach. However we do it, we need something initiated by > the server to maintain the archive and stop it from overflowing. Attached patch implements pg_standby for use as an archive_cleanup_command, reusing existing code with new -a option. e.g. archive_cleanup_command = 'pg_standby -a -d /mnt/server/archiverdir %r' Happy to add the archive_cleanup_command into main server as well, if you like. Won't take long. -- Simon Riggs www.2ndQuadrant.com
Attachment
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Fujii Masao
Date:
On Fri, Feb 12, 2010 at 10:10 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> So I suggest that you have a new action that gets called after every >> checkpoint to clear down the archive. It will remove all files from the >> archive prior to %r. We can implement that as a sequence of unlink()s >> from within the server, or we can just call a script to do it. I prefer >> the latter approach. However we do it, we need something initiated by >> the server to maintain the archive and stop it from overflowing. > > +1 If we leave executing the remove_command to the bgwriter, the restartpoint might not happen unfortunately for a long time. To prevent that situation, the archiver should execute the command, I think. Thought? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Fujii Masao wrote: > On Fri, Feb 12, 2010 at 10:10 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >>> So I suggest that you have a new action that gets called after every >>> checkpoint to clear down the archive. It will remove all files from the >>> archive prior to %r. We can implement that as a sequence of unlink()s >>> from within the server, or we can just call a script to do it. I prefer >>> the latter approach. However we do it, we need something initiated by >>> the server to maintain the archive and stop it from overflowing. >> +1 > > If we leave executing the remove_command to the bgwriter, the restartpoint > might not happen unfortunately for a long time. Are you thinking of a scenario where remove_command gets stuck, and prevents bgwriter from performing restartpoints while it's stuck? You have trouble if restore_command gets stuck like that as well, so I think we can require that the remove_command returns in a reasonable period of time, ie. in a few minutes. > To prevent that situation, the > archiver should execute the command, I think. Thought? The archiver isn't running in standby, so that's not going to work. And it's not connected to shared memory either, so it doesn't know what the latest restartpoint is. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Dimitri Fontaine
Date:
Simon Riggs <simon@2ndQuadrant.com> writes: > Attached patch implements pg_standby for use as an > archive_cleanup_command, reusing existing code with new -a option. > > Happy to add the archive_cleanup_command into main server as well, if > you like. Won't take long. Would it be possible to have the server do the cleanup (and the cp for that matter) on its own or use a script? Either have archive_cleanup = on and either an commented out (empty) archive_cleanup_command and the same for the restore_command, or a special magic value, like 'postgresql' to activate the internal one. This way we have both a zero config working default setup and the flexibility to adapt to any archiving solution our user might already be using. A default archive_command would be nice too. Like you give it a directory name or a rsync path and it does the basic right thing. I'm not sure my detailed approach is the right one, but the birdview is to have the simple case really simple to setup, and the complex cases still possible. Default included archive, restore and cleanup commands would be awesome. Oh, and the basic simple case actually is with a remote standby. Without NFS or the like, no shared mount point for archiving. Regards, -- dim
Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Greg Stark
Date:
<p>so I from by like having the server doing the cleanup because it down by necessarily have the while picture. it down ntknow of it is the only replica reading these log files our if the site policy is to keep them for disaster recovery purposes.<p>Ilike having this as an return val command though. Or alternately now that we have read-only replicas you couldimplement this by polling the slaves and asking them what log position they are up to.<br /><br /><p>greg<p><blockquotetype="cite">On 12 Feb 2010 17:25, "Dimitri Fontaine" <<a href="mailto:dfontaine@hi-media.com">dfontaine@hi-media.com</a>>wrote:<br /><br /><p><font color="#500050">Simon Riggs<simon@2ndQuadrant.com> writes:</font><p><font color="#500050">> Attached patch implements pg_standby for useas an<br />> archive_cleanup_command, reusing existing cod...</font><p><font color="#500050">> Happy to add thearchive_cleanup_command into main server as well, if<br /> > you like. Won't take long....</font>Would it be possibleto have the server do the cleanup (and the cp for<br /> that matter) on its own or use a script?<br /><br /> Eitherhave archive_cleanup = on and either an commented out (empty)<br /> archive_cleanup_command and the same for the restore_command,or a<br /> special magic value, like 'postgresql' to activate the internal one.<br /><br /> This way we haveboth a zero config working default setup and the<br /> flexibility to adapt to any archiving solution our user mightalready be<br /> using.<br /><br /> A default archive_command would be nice too. Like you give it a<br /> directoryname or a rsync path and it does the basic right thing.<br /><br /> I'm not sure my detailed approach is the rightone, but the birdview is<br /> to have the simple case really simple to setup, and the complex cases<br /> still possible.Default included archive, restore and cleanup commands<br /> would be awesome.<br /><br /> Oh, and the basic simplecase actually is with a remote standby. Without<br /> NFS or the like, no shared mount point for archiving.<br /><br/> Regards,<br /><font color="#888888">--<br /> dim<br /></font><p><font color="#500050"><br />-- <br />Sent via pgsql-hackersmailing list (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br />To make changesto your subs...</font></blockquote>
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Fujii Masao
Date:
On Sat, Feb 13, 2010 at 1:10 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Are you thinking of a scenario where remove_command gets stuck, and > prevents bgwriter from performing restartpoints while it's stuck? Yes. If there is the archive in the remote server and the network outage happens, remove_command might get stuck, I'm afraid. > You > have trouble if restore_command gets stuck like that as well, so I think > we can require that the remove_command returns in a reasonable period of > time, ie. in a few minutes. Oh, you are right! BTW, we need to note that remove_command approach would be useless if one archive is shared by multiple standbys. One standby might wrongly remove the archived WAL file that has been still required for another standby. In this case, we need to have the job script that calculates the archived WAL files that are required by no standbys, and removes them. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Fujii Masao
Date:
On Fri, Feb 12, 2010 at 2:29 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > So the only major feature we're missing is the ability to clean up old > files. I found another missing feature in new file-based log shipping (i.e., standby_mode is enabled and 'cp' is used as restore_command). After the trigger file is found, the startup process with pg_standby tries to replay all of the WAL files in both pg_xlog and the archive. So, when the primary fails, if the latest WAL file in pg_xlog of the primary can be read, we can prevent the data loss by copying it to pg_xlog of the standby before creating the trigger file. On the other hand, the startup process with standby mode doesn't replay the WAL files in pg_xlog after the trigger file is found. So failover always causes the data loss even if the latest WAL file can be read from the primary. And if the latest WAL file is copied to the archive instead, it can be replayed but a PANIC error would happen because it's not filled. We should remove this restriction? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Fujii Masao wrote: > I found another missing feature in new file-based log shipping (i.e., > standby_mode is enabled and 'cp' is used as restore_command). > > After the trigger file is found, the startup process with pg_standby > tries to replay all of the WAL files in both pg_xlog and the archive. > So, when the primary fails, if the latest WAL file in pg_xlog of the > primary can be read, we can prevent the data loss by copying it to > pg_xlog of the standby before creating the trigger file. > > On the other hand, the startup process with standby mode doesn't > replay the WAL files in pg_xlog after the trigger file is found. So > failover always causes the data loss even if the latest WAL file can > be read from the primary. And if the latest WAL file is copied to the > archive instead, it can be replayed but a PANIC error would happen > because it's not filled. > > We should remove this restriction? Looking into this, I realized that we have a bigger problem related to this. Although streaming replication stores the streamed WAL files in pg_xlog, so that they can be re-replayed after a standby restart without connecting to the master, we don't try to replay those either. So if you restart standby, it will fail to start up if the WAL it needs can't be found in archive or by connecting to the master. That must be fixed. I'd imagine that the ability to restore WAL files manually copied to pg_xlog will fall out of that fix too. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Simon Riggs
Date:
On Wed, 2010-03-17 at 12:35 +0200, Heikki Linnakangas wrote: > Looking into this, I realized that we have a bigger problem... A lot of this would be easier if you do the docs first, then work through the problems. The new system is more complex, since it has two modes rather than one and also multiple processes and a live connection. The number of failure cases must be higher than previously. Documenting how it is supposed to work in the event of failure will help everyone check those and comment on them. -- Simon Riggs www.2ndQuadrant.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Fujii Masao
Date:
On Wed, Mar 17, 2010 at 7:35 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Fujii Masao wrote: >> I found another missing feature in new file-based log shipping (i.e., >> standby_mode is enabled and 'cp' is used as restore_command). >> >> After the trigger file is found, the startup process with pg_standby >> tries to replay all of the WAL files in both pg_xlog and the archive. >> So, when the primary fails, if the latest WAL file in pg_xlog of the >> primary can be read, we can prevent the data loss by copying it to >> pg_xlog of the standby before creating the trigger file. >> >> On the other hand, the startup process with standby mode doesn't >> replay the WAL files in pg_xlog after the trigger file is found. So >> failover always causes the data loss even if the latest WAL file can >> be read from the primary. And if the latest WAL file is copied to the >> archive instead, it can be replayed but a PANIC error would happen >> because it's not filled. >> >> We should remove this restriction? > > Looking into this, I realized that we have a bigger problem related to > this. Although streaming replication stores the streamed WAL files in > pg_xlog, so that they can be re-replayed after a standby restart without > connecting to the master, we don't try to replay those either. So if you > restart standby, it will fail to start up if the WAL it needs can't be > found in archive or by connecting to the master. That must be fixed. I agree that this is a bigger problem. Since the standby always starts walreceiver before replaying any WAL files in pg_xlog, walreceiver tries to receive the WAL files following the REDO starting point even if they have already been in pg_xlog. IOW, the same WAL files might be shipped from the primary to the standby many times. This behavior is unsmart, and should be addressed. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Simon Riggs
Date:
On Thu, 2010-03-18 at 23:27 +0900, Fujii Masao wrote: > I agree that this is a bigger problem. Since the standby always starts > walreceiver before replaying any WAL files in pg_xlog, walreceiver tries > to receive the WAL files following the REDO starting point even if they > have already been in pg_xlog. IOW, the same WAL files might be shipped > from the primary to the standby many times. This behavior is unsmart, > and should be addressed. We might also have written half a file many times. The files in pg_xlog are suspect whereas the files in the archive are not. If we have both we should prefer the archive. -- Simon Riggs www.2ndQuadrant.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Simon Riggs wrote: > On Thu, 2010-03-18 at 23:27 +0900, Fujii Masao wrote: > >> I agree that this is a bigger problem. Since the standby always starts >> walreceiver before replaying any WAL files in pg_xlog, walreceiver tries >> to receive the WAL files following the REDO starting point even if they >> have already been in pg_xlog. IOW, the same WAL files might be shipped >> from the primary to the standby many times. This behavior is unsmart, >> and should be addressed. > > We might also have written half a file many times. The files in pg_xlog > are suspect whereas the files in the archive are not. If we have both we > should prefer the archive. Yep. Here's a patch I've been playing with. The idea is that in standby mode, the server keeps trying to make progress in the recovery by: a) restoring files from archive b) replaying files from pg_xlog c) streaming from master When recovery reaches an invalid WAL record, typically caused by a half-written WAL file, it closes the file and moves to the next source. If an error is found in a file restored from archive or in a portion just streamed from master, however, a PANIC is thrown, because it's not expected to have errors in the archive or in the master. When a file is streamed from master, it's left in pg_xlog, so it's found there after a standby restart, and recovery can progress to the same point as before restart. It also means that you can copy partial WAL files to pg_xlog at any time and have them replayed in a few seconds. The code structure is a bit spaghetti-like, I'm afraid. Any suggestions on how to improve that are welcome.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *************** *** 445,464 **** static uint32 openLogSeg = 0; static uint32 openLogOff = 0; /* * These variables are used similarly to the ones above, but for reading * the XLOG. Note, however, that readOff generally represents the offset * of the page just read, not the seek position of the FD itself, which * will be just past that page. readLen indicates how much of the current ! * page has been read into readBuf. */ static int readFile = -1; static uint32 readId = 0; static uint32 readSeg = 0; static uint32 readOff = 0; static uint32 readLen = 0; ! /* Is the currently open segment being streamed from primary? */ ! static bool readStreamed = false; /* Buffer for currently read page (XLOG_BLCKSZ bytes) */ static char *readBuf = NULL; --- 445,477 ---- static uint32 openLogOff = 0; /* + * Codes indicating where we got a WAL file from during recovery, or where + * to attempt to get one. + */ + #define XLOG_FROM_ARCHIVE (1<<0) /* Restored using restore_command */ + #define XLOG_FROM_PG_XLOG (1<<1) /* Existing file in pg_xlog */ + #define XLOG_FROM_STREAM (1<<2) /* Streamed from master */ + + /* * These variables are used similarly to the ones above, but for reading * the XLOG. Note, however, that readOff generally represents the offset * of the page just read, not the seek position of the FD itself, which * will be just past that page. readLen indicates how much of the current ! * page has been read into readBuf, and readSource indicates where we got ! * the currently open file from. */ static int readFile = -1; static uint32 readId = 0; static uint32 readSeg = 0; static uint32 readOff = 0; static uint32 readLen = 0; + static int readSource = 0; /* XLOG_FROM_* code */ ! /* ! * Keeps track of which sources we've tried to read the current WAL ! * record from and failed. ! */ ! static int failedSources = 0; /* Buffer for currently read page (XLOG_BLCKSZ bytes) */ static char *readBuf = NULL; *************** *** 512,520 **** static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath, bool find_free, int *max_advance, bool use_lock); static int XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli, ! bool fromArchive, bool notexistOk); static int XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, ! bool fromArchive); static bool XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, bool randAccess); static void XLogFileClose(void); --- 525,533 ---- bool find_free, int *max_advance, bool use_lock); static int XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli, ! int source, bool notexistOk); static int XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, ! int sources); static bool XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, bool randAccess); static void XLogFileClose(void); *************** *** 2567,2573 **** XLogFileOpen(uint32 log, uint32 seg) */ static int XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli, ! bool fromArchive, bool notfoundOk) { char xlogfname[MAXFNAMELEN]; char activitymsg[MAXFNAMELEN + 16]; --- 2580,2586 ---- */ static int XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli, ! int source, bool notfoundOk) { char xlogfname[MAXFNAMELEN]; char activitymsg[MAXFNAMELEN + 16]; *************** *** 2576,2598 **** XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli, XLogFileName(xlogfname, tli, log, seg); ! if (fromArchive) { ! /* Report recovery progress in PS display */ ! snprintf(activitymsg, sizeof(activitymsg), "waiting for %s", ! xlogfname); ! set_ps_display(activitymsg, false); ! restoredFromArchive = RestoreArchivedFile(path, xlogfname, ! "RECOVERYXLOG", ! XLogSegSize); ! if (!restoredFromArchive) ! return -1; ! } ! else ! { ! XLogFilePath(path, tli, log, seg); ! restoredFromArchive = false; } fd = BasicOpenFile(path, O_RDONLY | PG_BINARY, 0); --- 2589,2616 ---- XLogFileName(xlogfname, tli, log, seg); ! switch (source) { ! case XLOG_FROM_ARCHIVE: ! /* Report recovery progress in PS display */ ! snprintf(activitymsg, sizeof(activitymsg), "waiting for %s", ! xlogfname); ! set_ps_display(activitymsg, false); ! restoredFromArchive = RestoreArchivedFile(path, xlogfname, ! "RECOVERYXLOG", ! XLogSegSize); ! if (!restoredFromArchive) ! return -1; ! break; ! ! case XLOG_FROM_PG_XLOG: ! XLogFilePath(path, tli, log, seg); ! restoredFromArchive = false; ! break; ! ! default: ! elog(ERROR, "invalid XLogFileRead source %d", source); } fd = BasicOpenFile(path, O_RDONLY | PG_BINARY, 0); *************** *** 2606,2611 **** XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli, --- 2624,2631 ---- xlogfname); set_ps_display(activitymsg, false); + readSource = source; + return fd; } if (errno != ENOENT || !notfoundOk) /* unexpected failure? */ *************** *** 2624,2630 **** XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli, * searched in pg_xlog if not found in archive. */ static int ! XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, bool fromArchive) { char path[MAXPGPATH]; ListCell *cell; --- 2644,2650 ---- * searched in pg_xlog if not found in archive. */ static int ! XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, int sources) { char path[MAXPGPATH]; ListCell *cell; *************** *** 2647,2666 **** XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, bool fromArchive) if (tli < curFileTLI) break; /* don't bother looking at too-old TLIs */ ! fd = XLogFileRead(log, seg, emode, tli, fromArchive, true); ! if (fd != -1) ! return fd; ! /* ! * If not in StandbyMode, fall back to searching pg_xlog. In ! * StandbyMode we're streaming segments from the primary to pg_xlog, ! * and we mustn't confuse the (possibly partial) segments in pg_xlog ! * with complete segments ready to be applied. We rather wait for the ! * records to arrive through streaming. ! */ ! if (!StandbyMode && fromArchive) { ! fd = XLogFileRead(log, seg, emode, tli, false, true); if (fd != -1) return fd; } --- 2667,2685 ---- if (tli < curFileTLI) break; /* don't bother looking at too-old TLIs */ ! if (sources & XLOG_FROM_ARCHIVE) ! { ! fd = XLogFileRead(log, seg, emode, tli, XLOG_FROM_ARCHIVE, true); ! if (fd != -1) ! { ! elog(DEBUG1, "got WAL segment from archive"); ! return fd; ! } ! } ! if (sources & XLOG_FROM_PG_XLOG) { ! fd = XLogFileRead(log, seg, emode, tli, XLOG_FROM_PG_XLOG, true); if (fd != -1) return fd; } *************** *** 3530,3545 **** ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt) uint32 pageHeaderSize; int emode; - /* - * We don't expect any invalid records during streaming recovery: we - * should never hit the end of WAL because we wait for it to be streamed. - * Therefore treat any broken WAL as PANIC, instead of failing over. - */ - if (StandbyMode) - emode = PANIC; - else - emode = emode_arg; - if (readBuf == NULL) { /* --- 3549,3554 ---- *************** *** 3591,3600 **** ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt) randAccess = true; /* allow curFileTLI to go backwards too */ } /* Read the page containing the record */ ! if (!XLogPageRead(RecPtr, emode, fetching_ckpt, randAccess)) return NULL; pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf); targetRecOff = RecPtr->xrecoff % XLOG_BLCKSZ; if (targetRecOff == 0) --- 3600,3623 ---- randAccess = true; /* allow curFileTLI to go backwards too */ } + /* This is the first try read this page. */ + failedSources = 0; + retry: /* Read the page containing the record */ ! if (!XLogPageRead(RecPtr, emode_arg, fetching_ckpt, randAccess)) return NULL; + /* + * We don't expect any invalid records in archive or in records streamed + * from master: we should never hit the end of WAL because we wait for it + * to be streamed. Therefore treat any broken WAL as PANIC, instead of + * failing over. + */ + if (readSource == XLOG_FROM_STREAM || readSource == XLOG_FROM_ARCHIVE) + emode = PANIC; + else + emode = emode_arg; + pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf); targetRecOff = RecPtr->xrecoff % XLOG_BLCKSZ; if (targetRecOff == 0) *************** *** 3828,3833 **** next_record_is_invalid:; --- 3851,3864 ---- close(readFile); readFile = -1; } + + /* In standby-mode, retry from another source */ + if (StandbyMode) + { + failedSources |= readSource; + goto retry; + } + return NULL; } *************** *** 8698,8704 **** StartupProcessMain(void) * as for waiting for the requested WAL record to arrive in standby mode. */ static bool ! XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, bool randAccess) { static XLogRecPtr receivedUpto = {0, 0}; --- 8729,8735 ---- * as for waiting for the requested WAL record to arrive in standby mode. */ static bool ! XLogPageRead(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt, bool randAccess) { static XLogRecPtr receivedUpto = {0, 0}; *************** *** 8707,8719 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, uint32 targetRecOff; uint32 targetId; uint32 targetSeg; XLByteToSeg(*RecPtr, targetId, targetSeg); targetPageOff = ((RecPtr->xrecoff % XLogSegSize) / XLOG_BLCKSZ) * XLOG_BLCKSZ; targetRecOff = RecPtr->xrecoff % XLOG_BLCKSZ; /* Fast exit if we have read the record in the current buffer already */ ! if (targetId == readId && targetSeg == readSeg && targetPageOff == readOff && targetRecOff < readLen) return true; --- 8738,8752 ---- uint32 targetRecOff; uint32 targetId; uint32 targetSeg; + int emode; + static pg_time_t last_fail_time = 0; XLByteToSeg(*RecPtr, targetId, targetSeg); targetPageOff = ((RecPtr->xrecoff % XLogSegSize) / XLOG_BLCKSZ) * XLOG_BLCKSZ; targetRecOff = RecPtr->xrecoff % XLOG_BLCKSZ; /* Fast exit if we have read the record in the current buffer already */ ! if (failedSources == 0 && targetId == readId && targetSeg == readSeg && targetPageOff == readOff && targetRecOff < readLen) return true; *************** *** 8725,8742 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, { close(readFile); readFile = -1; } XLByteToSeg(*RecPtr, readId, readSeg); /* See if we need to retrieve more data */ if (readFile < 0 || ! (readStreamed && !XLByteLT(*RecPtr, receivedUpto))) { if (StandbyMode) { - bool last_restore_failed = false; - /* * In standby mode, wait for the requested record to become * available, either via restore_command succeeding to restore the --- 8758,8775 ---- { close(readFile); readFile = -1; + readSource = 0; } XLByteToSeg(*RecPtr, readId, readSeg); + retry: /* See if we need to retrieve more data */ if (readFile < 0 || ! (readSource == XLOG_FROM_STREAM && !XLByteLT(*RecPtr, receivedUpto))) { if (StandbyMode) { /* * In standby mode, wait for the requested record to become * available, either via restore_command succeeding to restore the *************** *** 8746,8751 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, --- 8779,8786 ---- { if (WalRcvInProgress()) { + failedSources = 0; + /* * While walreceiver is active, wait for new WAL to arrive * from primary. *************** *** 8761,8775 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, { readFile = XLogFileRead(readId, readSeg, PANIC, ! recoveryTargetTLI, false, false); switched_segment = true; ! readStreamed = true; } break; } if (CheckForStandbyTrigger()) ! goto next_record_is_invalid; /* * When streaming is active, we want to react quickly when --- 8796,8811 ---- { readFile = XLogFileRead(readId, readSeg, PANIC, ! recoveryTargetTLI, ! XLOG_FROM_PG_XLOG, false); switched_segment = true; ! readSource = XLOG_FROM_STREAM; } break; } if (CheckForStandbyTrigger()) ! goto triggered; /* * When streaming is active, we want to react quickly when *************** *** 8779,8784 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, --- 8815,8823 ---- } else { + int sources; + pg_time_t now; + /* * Until walreceiver manages to reconnect, poll the * archive. *************** *** 8791,8828 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, /* Reset curFileTLI if random fetch. */ if (randAccess) curFileTLI = 0; ! readFile = XLogFileReadAnyTLI(readId, readSeg, DEBUG2, true); switched_segment = true; - readStreamed = false; if (readFile != -1) { - elog(DEBUG1, "got WAL segment from archive"); break; } /* ! * If we succeeded restoring some segments from archive ! * since the last connection attempt (or we haven't tried ! * streaming yet, retry immediately. But if we haven't, ! * assume the problem is persistent, so be less ! * aggressive. */ ! if (last_restore_failed) { ! /* ! * Check to see if the trigger file exists. Note that ! * we do this only after failure, so when you create ! * the trigger file, we still finish replaying as much ! * as we can before failover. ! */ ! if (CheckForStandbyTrigger()) ! goto next_record_is_invalid; ! pg_usleep(5000000L); /* 5 seconds */ } ! last_restore_failed = true; /* ! * Nope, not found in archive. Try to stream it. * * If fetching_ckpt is TRUE, RecPtr points to the initial * checkpoint location. In that case, we use RedoStartLSN --- 8830,8877 ---- /* Reset curFileTLI if random fetch. */ if (randAccess) curFileTLI = 0; ! ! /* ! * Try to restore the file from archive, or read an ! * existing file from pg_xlog. ! */ ! sources = XLOG_FROM_ARCHIVE | XLOG_FROM_PG_XLOG; ! sources &= ~failedSources; ! readFile = XLogFileReadAnyTLI(readId, readSeg, DEBUG2, ! sources); switched_segment = true; if (readFile != -1) { break; } /* ! * Nope, not found in archive. ! */ ! ! /* ! * Check to see if the trigger file exists. Note that ! * we do this only after failure, so when you create ! * the trigger file, we still finish replaying as much ! * as we can from archive and pg_xlog before failover. */ ! if (CheckForStandbyTrigger()) ! goto triggered; ! ! /* ! * Sleep if it hasn't been long since last attempt. ! */ ! now = (pg_time_t) time(NULL); ! if ((now - last_fail_time) < 5) { ! pg_usleep(1000000L * (5 - (now - last_fail_time))); ! now = (pg_time_t) time(NULL); } ! last_fail_time = now; /* ! * If primary_conninfo is set, launch walreceiver to ! * try to stream the missing WAL. * * If fetching_ckpt is TRUE, RecPtr points to the initial * checkpoint location. In that case, we use RedoStartLSN *************** *** 8847,8859 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, /* In archive or crash recovery. */ if (readFile < 0) { /* Reset curFileTLI if random fetch. */ if (randAccess) curFileTLI = 0; ! readFile = XLogFileReadAnyTLI(readId, readSeg, emode, ! InArchiveRecovery); switched_segment = true; - readStreamed = false; if (readFile < 0) return false; } --- 8896,8914 ---- /* In archive or crash recovery. */ if (readFile < 0) { + int sources; /* Reset curFileTLI if random fetch. */ if (randAccess) curFileTLI = 0; ! ! sources = XLOG_FROM_PG_XLOG; ! if (InArchiveRecovery) ! sources |= XLOG_FROM_ARCHIVE; ! sources &= ~failedSources; ! ! readFile = XLogFileReadAnyTLI(readId, readSeg, emode_arg, ! sources); switched_segment = true; if (readFile < 0) return false; } *************** *** 8861,8878 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, } /* ! * At this point, we have the right segment open and we know the requested ! * record is in it. */ Assert(readFile != -1); /* * If the current segment is being streamed from master, calculate how * much of the current page we have received already. We know the * requested record has been received, but this is for the benefit of * future calls, to allow quick exit at the top of this function. */ ! if (readStreamed) { if (RecPtr->xlogid != receivedUpto.xlogid || (RecPtr->xrecoff / XLOG_BLCKSZ) != (receivedUpto.xrecoff / XLOG_BLCKSZ)) --- 8916,8944 ---- } /* ! * At this point, we have the right segment open and if we're streaming ! * we know the requested record is in it. */ Assert(readFile != -1); /* + * We don't expect any invalid records in archive or in records streamed + * from master: we should never hit the end of WAL because we wait for it + * to be streamed. Therefore treat any broken WAL as PANIC, instead of + * failing over. + */ + if (readSource == XLOG_FROM_STREAM || readSource == XLOG_FROM_ARCHIVE) + emode = PANIC; + else + emode = emode_arg; + + /* * If the current segment is being streamed from master, calculate how * much of the current page we have received already. We know the * requested record has been received, but this is for the benefit of * future calls, to allow quick exit at the top of this function. */ ! if (readSource == XLOG_FROM_STREAM) { if (RecPtr->xlogid != receivedUpto.xlogid || (RecPtr->xrecoff / XLOG_BLCKSZ) != (receivedUpto.xrecoff / XLOG_BLCKSZ)) *************** *** 8936,8946 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, return true; next_record_is_invalid: if (readFile >= 0) close(readFile); readFile = -1; - readStreamed = false; readLen = 0; return false; } --- 9002,9026 ---- return true; next_record_is_invalid: + failedSources |= readSource; + + if (readFile >= 0) + close(readFile); + readFile = -1; + readLen = 0; + readSource = 0; + + if (StandbyMode) + goto retry; + else + return false; + + triggered: if (readFile >= 0) close(readFile); readFile = -1; readLen = 0; + readSource = 0; return false; }
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Simon Riggs wrote: >> We might also have written half a file many times. The files in pg_xlog >> are suspect whereas the files in the archive are not. If we have both we >> should prefer the archive. > Yep. Really? That will result in a change in the longstanding behavior of ordinary recovery. I'm unconvinced that this is wise. regards, tom lane
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Simon Riggs wrote: >>> We might also have written half a file many times. The files in pg_xlog >>> are suspect whereas the files in the archive are not. If we have both we >>> should prefer the archive. > >> Yep. > > Really? That will result in a change in the longstanding behavior of > ordinary recovery. Really? Not as far as I can see. Recovery has always tried to restore WAL files from archive first, falling back to the copy in pg_xlog only if restore_command returns failure. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Alvaro Herrera
Date:
Heikki Linnakangas escribió: > When recovery reaches an invalid WAL record, typically caused by a > half-written WAL file, it closes the file and moves to the next source. > If an error is found in a file restored from archive or in a portion > just streamed from master, however, a PANIC is thrown, because it's not > expected to have errors in the archive or in the master. Hmm, I think I've heard that tools like walmgr do incremental copies of the current WAL segment to the archive. Doesn't this change break that? (Maybe I'm confused and it doesn't work that way) -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Alvaro Herrera wrote: > Heikki Linnakangas escribió: > >> When recovery reaches an invalid WAL record, typically caused by a >> half-written WAL file, it closes the file and moves to the next source. >> If an error is found in a file restored from archive or in a portion >> just streamed from master, however, a PANIC is thrown, because it's not >> expected to have errors in the archive or in the master. > > Hmm, I think I've heard that tools like walmgr do incremental copies of > the current WAL segment to the archive. Doesn't this change break that? Hmm, you could have a restore_command that checks the size before restoring to make it still work. I note that pg_standby does that, but of course you can't use pg_standby with the built-in standby mode. Or maybe we should modify the built-in standby mode to handle partial files coming from restore_command by not throwing an error but recovering to the end of the partial file, and then retrying restore_command again with the same filename until the whole file is recovered (or the missing WAL is received through other means, ie. streaming replication). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Fujii Masao
Date:
Sorry for the delay. On Fri, Mar 19, 2010 at 8:37 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Here's a patch I've been playing with. Thanks! I'm reading the patch. > The idea is that in standby mode, > the server keeps trying to make progress in the recovery by: > > a) restoring files from archive > b) replaying files from pg_xlog > c) streaming from master > > When recovery reaches an invalid WAL record, typically caused by a > half-written WAL file, it closes the file and moves to the next source. > If an error is found in a file restored from archive or in a portion > just streamed from master, however, a PANIC is thrown, because it's not > expected to have errors in the archive or in the master. But in the current (v8.4 or before) behavior, recovery ends normally when an invalid record is found in an archived WAL file. Otherwise, the server would never be able to start normal processing when there is a corrupted archived file for some reasons. So, that invalid record should not be treated as a PANIC if the server is not in standby mode or the trigger file has been created. Thought? When I tested the patch, the following PANIC error was thrown in the normal archive recovery. This seems to derive from the above change. The detail error sequence: 1. In ReadRecord(), emode was set to PANIC after 00000001000000000000000B was read. 2. 00000001000000000000000C including the contrecord tried to be read by using the emode (= PANIC). But since 00000001000000000000000Cdid not exist, PANIC error was thrown. ----------------- LOG: restored log file "00000001000000000000000B" from archive cp: cannot stat `../data.arh/00000001000000000000000C': No such file or directory PANIC: could not open file "pg_xlog/00000001000000000000000C" (log file 0, segment 12): No such file or directory LOG: startup process (PID 17204) was terminated by signal 6: Aborted LOG: terminating any other active server processes ----------------- Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Fujii Masao wrote: > But in the current (v8.4 or before) behavior, recovery ends normally > when an invalid record is found in an archived WAL file. Otherwise, > the server would never be able to start normal processing when there > is a corrupted archived file for some reasons. So, that invalid record > should not be treated as a PANIC if the server is not in standby mode > or the trigger file has been created. Thought? Hmm, true, this changes behavior over previous releases. I tend to think that it's always an error if there's a corrupt file in the archive, though, and PANIC is appropriate. If the administrator wants to start up the database anyway, he can remove the corrupt file from the archive and place it directly in pg_xlog instead. > When I tested the patch, the following PANIC error was thrown in the > normal archive recovery. This seems to derive from the above change. > The detail error sequence: > 1. In ReadRecord(), emode was set to PANIC after 00000001000000000000000B > was read. > 2. 00000001000000000000000C including the contrecord tried to be read > by using the emode (= PANIC). But since 00000001000000000000000C did > not exist, PANIC error was thrown. > > ----------------- > LOG: restored log file "00000001000000000000000B" from archive > cp: cannot stat `../data.arh/00000001000000000000000C': No such file > or directory > PANIC: could not open file "pg_xlog/00000001000000000000000C" (log > file 0, segment 12): No such file or directory > LOG: startup process (PID 17204) was terminated by signal 6: Aborted > LOG: terminating any other active server processes > ----------------- Thanks. That's easily fixable (applies over the previous patch): --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -3773,7 +3773,7 @@ retry: pagelsn.xrecoff = 0; } /* Wait for the next page to becomeavailable */ - if (!XLogPageRead(&pagelsn, emode, false, false)) + if (!XLogPageRead(&pagelsn, emode_arg, false, false)) return NULL; /* Check that the continuation record looks valid */ Perhaps the emode/emode_arg convention is a bit hard to read. I'll go through the patch myself once more, and commit later today or tomorrow if now new issues crop up. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Fujii Masao
Date:
On Wed, Mar 24, 2010 at 9:31 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Hmm, true, this changes behavior over previous releases. I tend to think > that it's always an error if there's a corrupt file in the archive, > though, and PANIC is appropriate. If the administrator wants to start up > the database anyway, he can remove the corrupt file from the archive and > place it directly in pg_xlog instead. Okay. > Thanks. That's easily fixable (applies over the previous patch): > > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -3773,7 +3773,7 @@ retry: > pagelsn.xrecoff = 0; > } > /* Wait for the next page to become available */ > - if (!XLogPageRead(&pagelsn, emode, false, false)) > + if (!XLogPageRead(&pagelsn, emode_arg, false, false)) > return NULL; > > /* Check that the continuation record looks valid */ Seems correct. > sources &= ~failedSources; > failedSources |= readSource; The above lines in XLogPageRead() seem not to be required in normal recovery case (i.e., standby_mode = off). So how about the attached patch? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Fujii Masao
Date:
On Wed, Mar 24, 2010 at 10:20 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> Thanks. That's easily fixable (applies over the previous patch): >> >> --- a/src/backend/access/transam/xlog.c >> +++ b/src/backend/access/transam/xlog.c >> @@ -3773,7 +3773,7 @@ retry: >> pagelsn.xrecoff = 0; >> } >> /* Wait for the next page to become available */ >> - if (!XLogPageRead(&pagelsn, emode, false, false)) >> + if (!XLogPageRead(&pagelsn, emode_arg, false, false)) >> return NULL; >> >> /* Check that the continuation record looks valid */ > > Seems correct. On second thought, the following lines seem to be necessary just after calling XLogPageRead() since it reads new WAL file from another source. > if (readSource == XLOG_FROM_STREAM || readSource == XLOG_FROM_ARCHIVE) > emode = PANIC; > else > emode = emode_arg; Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Simon Riggs
Date:
On Wed, 2010-03-24 at 14:31 +0200, Heikki Linnakangas wrote: > Fujii Masao wrote: > > But in the current (v8.4 or before) behavior, recovery ends normally > > when an invalid record is found in an archived WAL file. Otherwise, > > the server would never be able to start normal processing when there > > is a corrupted archived file for some reasons. So, that invalid record > > should not be treated as a PANIC if the server is not in standby mode > > or the trigger file has been created. Thought? > > Hmm, true, this changes behavior over previous releases. I tend to think > that it's always an error if there's a corrupt file in the archive, > though, and PANIC is appropriate. If the administrator wants to start up > the database anyway, he can remove the corrupt file from the archive and > place it directly in pg_xlog instead. I don't agree with changing the behaviour from previous releases. PANICing won't change the situation, so it just destroys server availability. If we had 1 master and 42 slaves then this behaviour would take down almost the whole server farm at once. Very uncool. You might have reason to prevent the server starting up at that point, when in standby mode, but that is not a reason to PANIC. We don't really want all of the standbys thinking they can be the master all at once either. Better to throw a serious ERROR and have the server still up and available for reads. -- Simon Riggs www.2ndQuadrant.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Fujii Masao
Date:
On Thu, Mar 25, 2010 at 8:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > PANICing won't change the situation, so it just destroys server > availability. If we had 1 master and 42 slaves then this behaviour would > take down almost the whole server farm at once. Very uncool. > > You might have reason to prevent the server starting up at that point, > when in standby mode, but that is not a reason to PANIC. We don't really > want all of the standbys thinking they can be the master all at once > either. Better to throw a serious ERROR and have the server still up and > available for reads. OK. How about making the startup process emit WARNING, stop WAL replay and wait for the presence of trigger file, when an invalid record is found? Which keeps the server up for readonly queries. And if the trigger file is found, I think that the startup process should emit a FATAL, i.e., the server should exit immediately, to prevent the server from becoming the primary in a half-finished state. Also to allow such a halfway failover, we should provide fast failover mode as pg_standby does? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Tom Lane
Date:
Fujii Masao <masao.fujii@gmail.com> writes: > OK. How about making the startup process emit WARNING, stop WAL replay and > wait for the presence of trigger file, when an invalid record is found? > Which keeps the server up for readonly queries. And if the trigger file is > found, I think that the startup process should emit a FATAL, i.e., the > server should exit immediately, to prevent the server from becoming the > primary in a half-finished state. Also to allow such a halfway failover, > we should provide fast failover mode as pg_standby does? I find it extremely scary to read this sort of blue-sky design discussion going on now, two months after we were supposedly feature-frozen for 9.0. We need to be looking for the *rock bottom minimum* amount of work to do to get 9.0 out the door in a usable state; not what would be nice to have later on. regards, tom lane
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Simon Riggs
Date:
On Thu, 2010-03-25 at 11:08 +0900, Fujii Masao wrote: > On Thu, Mar 25, 2010 at 8:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > PANICing won't change the situation, so it just destroys server > > availability. If we had 1 master and 42 slaves then this behaviour would > > take down almost the whole server farm at once. Very uncool. > > > > You might have reason to prevent the server starting up at that point, > > when in standby mode, but that is not a reason to PANIC. We don't really > > want all of the standbys thinking they can be the master all at once > > either. Better to throw a serious ERROR and have the server still up and > > available for reads. > > OK. How about making the startup process emit WARNING, stop WAL replay and > wait for the presence of trigger file, when an invalid record is found? > Which keeps the server up for readonly queries. And if the trigger file is > found, I think that the startup process should emit a FATAL, i.e., the > server should exit immediately, to prevent the server from becoming the > primary in a half-finished state. Also to allow such a halfway failover, > we should provide fast failover mode as pg_standby does? The lack of docs begins to show a lack of coherent high-level design here. By now, I've forgotten what this thread was even about. The major design decision in this that keeps showing up is "remove pg_standby, at all costs" but no reason has ever been given for that. I do believe there is a "better way", but we won't find it by trial and error, even if we had time to do so. Please work on some clear docs for the failure modes in this system. That way we can all read them and understand them, or point out further issues. Moving straight to code is not a solution to this, since what we need now is to all agree on the way forwards. If we ignore this, then there is considerable risk that streaming rep will have a fatal operational flaw. Please just document/diagram how it works now, highlighting the problems that still remain to be solved. We're all behind you and I'm helping wherever I can. -- Simon Riggs www.2ndQuadrant.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Tom Lane wrote: > Fujii Masao <masao.fujii@gmail.com> writes: >> OK. How about making the startup process emit WARNING, stop WAL replay and >> wait for the presence of trigger file, when an invalid record is found? >> Which keeps the server up for readonly queries. And if the trigger file is >> found, I think that the startup process should emit a FATAL, i.e., the >> server should exit immediately, to prevent the server from becoming the >> primary in a half-finished state. Also to allow such a halfway failover, >> we should provide fast failover mode as pg_standby does? > > I find it extremely scary to read this sort of blue-sky design > discussion going on now, two months after we were supposedly > feature-frozen for 9.0. We need to be looking for the *rock bottom > minimum* amount of work to do to get 9.0 out the door in a usable > state; not what would be nice to have later on. Agreed, this is getting complicated. I'm already worried about the amount of changes needed to make it work, I don't want to add any new modes. PANIC seems like the appropriate solution for now. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Simon Riggs
Date:
On Thu, 2010-03-25 at 11:08 +0900, Fujii Masao wrote: > On Thu, Mar 25, 2010 at 8:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > PANICing won't change the situation, so it just destroys server > > availability. If we had 1 master and 42 slaves then this behaviour would > > take down almost the whole server farm at once. Very uncool. > > > > You might have reason to prevent the server starting up at that point, > > when in standby mode, but that is not a reason to PANIC. We don't really > > want all of the standbys thinking they can be the master all at once > > either. Better to throw a serious ERROR and have the server still up and > > available for reads. > > OK. How about making the startup process emit WARNING, stop WAL replay and > wait for the presence of trigger file, when an invalid record is found? > Which keeps the server up for readonly queries. Yes. Receiving new WAL records is a completely separate activity from running the rest of the server (in this release...). > And if the trigger file is > found, I think that the startup process should emit a FATAL, i.e., the > server should exit immediately, to prevent the server from becoming the > primary in a half-finished state. Please remember that "half-finished" is your judgment on what has happened in the particular scenario you are considering. In many cases, an invalid WAL record clearly and simply indicates the end of WAL and we should start up normally. "State" is a good word here. I'd like to see the server have a clear state model with well documented transitions between them. The state should also be externally queriable, so we can work out what its doing and how long we can expect it to keep doing it for. I don't want to be in a position where we are waiting for the server to sort itself out from a complex set of retries. > Also to allow such a halfway failover, > we should provide fast failover mode as pg_standby does? Yes, we definitely need a JFDI solution for immediate failover. -- Simon Riggs www.2ndQuadrant.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Simon Riggs
Date:
On Thu, 2010-03-25 at 10:11 +0200, Heikki Linnakangas wrote: > PANIC seems like the appropriate solution for now. It definitely is not. Think some more. -- Simon Riggs www.2ndQuadrant.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Simon Riggs wrote: > On Thu, 2010-03-25 at 11:08 +0900, Fujii Masao wrote: >> And if the trigger file is >> found, I think that the startup process should emit a FATAL, i.e., the >> server should exit immediately, to prevent the server from becoming the >> primary in a half-finished state. > > Please remember that "half-finished" is your judgment on what has > happened in the particular scenario you are considering. In many cases, > an invalid WAL record clearly and simply indicates the end of WAL and we > should start up normally. Not in the archive. An invalid WAL record in a file in pg_xlog is usually an indication of end-of-WAL, but there should be no invalid records in the archived WAL files, or streamed from the master. > "State" is a good word here. I'd like to see the server have a clear > state model with well documented transitions between them. The state > should also be externally queriable, so we can work out what its doing > and how long we can expect it to keep doing it for. Agreed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
(cc'ing docs list) Simon Riggs wrote: > The lack of docs begins to show a lack of coherent high-level design > here. Yeah, I think you're right. It's becoming hard to keep track of how it's supposed to behave. > By now, I've forgotten what this thread was even about. The major > design decision in this that keeps showing up is "remove pg_standby, at > all costs" but no reason has ever been given for that. I do believe > there is a "better way", but we won't find it by trial and error, even > if we had time to do so. This has nothing to do with pg_standby. > Please work on some clear docs for the failure modes in this system. > That way we can all read them and understand them, or point out further > issues. Moving straight to code is not a solution to this, since what we > need now is to all agree on the way forwards. If we ignore this, then > there is considerable risk that streaming rep will have a fatal > operational flaw. > > Please just document/diagram how it works now, highlighting the problems > that still remain to be solved. We're all behind you and I'm helping > wherever I can. Ok, here's my attempt at the docs. Read it as a replacement for the "High Availability, Load Balancing, and Replication" chapter, but of course many of the sections will be unchanged, as indicated below. ------------- Chapter 25. High Availability, Load Balancing, and Replication 25.1 Comparison of different solutions <no changes> 25.2 Log-Shipping Standby servers <overview from current "File-based Log Shipping" section. With small changes so that it applies to the built-in standby mode as well as pg_standby like solutions> A standby server can also be used for read-only queries. This is called Hot Standby mode, see chapter XXX 25.2.1 Planning Set up two servers with identical hardware ... <two first paragraphs of current File-based log-shipping / Planning section> 25.2.3 Standby mode In standby mode, the server continously applies WAL received from the master server. The standby server can receive WAL from a WAL archive (see restore_command) or directly from the master over a TCP connection (streaming replication). The standby server will also attempt to restore any WAL found in the standby's pg_xlog. That typically happens after a server restart, to replay again any WAL streamed from the master before the restart, but you can also manually copy files to pg_xlog at any time to have them replayed. At startup, the standby begins by restoring all WAL available in the archive location, calling restore_command. Once it reaches the end of WAL available there and restore_command fails, it tries to restore any WAL available in the pg_xlog directory (possibly stored there by streaming replication before restart). If that fails, and streaming replication has been configured, the standby tries to connect to the master server and stream WAL from it. If that fails or streaming replication is not configured, or if the connection is disconnected later on, the standby goes back to step 1 and tries to restoring the file from the archive again. This loop of retries from the archive, pg_xlog, and via streaming replication goes on until the server is stopped or failover is triggered by a trigger file. A corrupt or half-finished WAL file in the archive, or streamed from the master, causes a PANIC and immediate shutdown of the standby server. A corrupt WAL file is always a serious event which requires administrator action. If you want to recover a WAL file known to be corrupt as far as it can be, you can copy the file manually into pg_xlog. Standby mode is exited and the server switches to normal operation, when a trigger file is found (trigger_file). Before failover, it will restore any WAL available in the archive or in pg_xlog, but won't try to connect to the master or wait for files to become available in the archive. 25.2.4 Preparing Master for Standby servers Set up continous archiving to a WAL archive on the master, as described in the chapter "Continous Archiving and Point-In-Time_recovery". The archive location should be accessible from the standby even when the master is down, ie. it should reside on the standby server itself or another trusted server, not on the master server. If you want to use streaming replication, set up authentication to allow streaming replication connections. Set max_wal_senders. Take a base backup as described in chapter Continous Archiving and Point-In-Time_recovery / Making a Base Backup. 25.2.4.1 Authentication for streaming replication Ensure that listen_addresses allows connections from the standby server. <current Streaming Replication / Authentication section, describing pg_hba.conf> 25.2.5 Setting up the standby server 1. Take a base backup, and copy it to the standby 2. Create a restore_command to restore files from the WAL archive. 3. Set standby_mode=on 4. If you want to use streaming replicaton, set primary_conninfo You can use restartpoint_command to prune the archive of files no longer needed by the standby. You can have any number of standby servers as long as you set max_wal_senders high enough in the master to allow them to be connected simultaneously. 25.2.6 Archive recovery based log shipping An alternative to the built-in standby mode desribed in the previous sections is to use a restore_command that polls the archive location. This was the only option available in versions 8.4 and below. In this setup, set standby_mode=off, because you are implementing the polling required for a standby server yourself. See contrib/pg_standby for a reference implementation of this. Note that the in this mode, the server will apply WAL one file at a time, so if you use the standby server for queries (see Hot Standby), there is a bigger delay between an action in the master and when the action becomes visible in the standby, corresponding the time it takes to fill up the WAL file. archive_timeout can be used to make that delay shorter. Also note that you can't combine streaming replication with this method. <all the sections from current File-based log shipping that don't apply to built-in standby mode go here> 25.3 Hot Standby <no changes> 25.4 Incrementally Updated backups <no changes> ------------- -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Simon Riggs wrote: > On Thu, 2010-03-25 at 10:11 +0200, Heikki Linnakangas wrote: > >> PANIC seems like the appropriate solution for now. > > It definitely is not. Think some more. Well, what happens now in previous versions with pg_standby et al is that the standby starts up. That doesn't seem appropriate either. Hmm, it would be trivial to just stay in the standby mode at a corrupt file, continuously retrying to restore it and continue replay. If it's genuinely corrupt, it will never succeed and the standby gets stuck at that point. Maybe that's better; it's close to what Fujii suggested except that you don't need a new mode for it. I'm worried that the administrator won't notice the error promptly because at a quick glance the server is up and running, while it's actually stuck at the error and falling indefinitely behind the master. Maybe if we make it a WARNING, that's enough to alleviate that. It's true that if the standby is actively being used for read-only queries, shutting it down to just get the administrators attention isn't good either. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote: > Simon Riggs wrote: >> On Thu, 2010-03-25 at 10:11 +0200, Heikki Linnakangas wrote: >> >>> PANIC seems like the appropriate solution for now. >> It definitely is not. Think some more. > > Well, what happens now in previous versions with pg_standby et al is > that the standby starts up. That doesn't seem appropriate either. > > Hmm, it would be trivial to just stay in the standby mode at a corrupt > file, continuously retrying to restore it and continue replay. If it's > genuinely corrupt, it will never succeed and the standby gets stuck at > that point. Maybe that's better; it's close to what Fujii suggested > except that you don't need a new mode for it. On second thought, that's easy for the built-in standby mode, but not for archive recovery where we're not currently retrying anything. In archive recovery, we could throw a WARNING and start up, which would be like the current behavior in older versions except you get a WARNING instead of LOG, or we could PANIC. I'm leaning towards PANIC, which makes most sense for genuine point-in-time or archive recovery (ie. not a standby server), but I can see the rationale for WARNING too, for pg_standby and for the sake of preserving old behavior. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Fujii Masao wrote: >> sources &= ~failedSources; >> failedSources |= readSource; > > The above lines in XLogPageRead() seem not to be required in normal > recovery case (i.e., standby_mode = off). So how about the attached > patch? > > *** 9050,9056 **** next_record_is_invalid: > --- 9047,9056 ---- > readSource = 0; > > if (StandbyMode) > + { > + failedSources |= readSource; > goto retry; > + } > else > return false; That doesn't work because readSource is cleared above. But yeah, failedSources is not needed in archive recovery, so that line can be removed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Fujii Masao wrote: > On second thought, the following lines seem to be necessary just after > calling XLogPageRead() since it reads new WAL file from another source. > >> if (readSource == XLOG_FROM_STREAM || readSource == XLOG_FROM_ARCHIVE) >> emode = PANIC; >> else >> emode = emode_arg; Yep. Here's an updated patch, with these changes since the last patch: * Fix the bug of a spurious PANIC in archive recovery, if the WAL ends in the middle of a WAL record that continues over a WAL segment boundary. * If a corrupt WAL record is found in archive or streamed from master in standby mode, throw WARNING instead of PANIC, and keep trying. In archive recovery (ie. standby_mode=off) it's still a PANIC. We can make it a WARNING too, which gives the pre-9.0 behavior of starting up the server on corruption. I prefer PANIC but the discussion is still going on. * Small code changes to handling of failedSources, inspired by your comment. No change in functionality. This is also available in my git repository at git://git.postgresql.org/git/users/heikki/postgres.git, branch "xlogchanges" -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e57f22e..4aa1870 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -450,20 +450,33 @@ static uint32 openLogSeg = 0; static uint32 openLogOff = 0; /* + * Codes indicating where we got a WAL file from during recovery, or where + * to attempt to get one. + */ +#define XLOG_FROM_ARCHIVE (1<<0) /* Restored using restore_command */ +#define XLOG_FROM_PG_XLOG (1<<1) /* Existing file in pg_xlog */ +#define XLOG_FROM_STREAM (1<<2) /* Streamed from master */ + +/* * These variables are used similarly to the ones above, but for reading * the XLOG. Note, however, that readOff generally represents the offset * of the page just read, not the seek position of the FD itself, which * will be just past that page. readLen indicates how much of the current - * page has been read into readBuf. + * page has been read into readBuf, and readSource indicates where we got + * the currently open file from. */ static int readFile = -1; static uint32 readId = 0; static uint32 readSeg = 0; static uint32 readOff = 0; static uint32 readLen = 0; +static int readSource = 0; /* XLOG_FROM_* code */ -/* Is the currently open segment being streamed from primary? */ -static bool readStreamed = false; +/* + * Keeps track of which sources we've tried to read the current WAL + * record from and failed. + */ +static int failedSources = 0; /* Buffer for currently read page (XLOG_BLCKSZ bytes) */ static char *readBuf = NULL; @@ -517,11 +530,12 @@ static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath, bool find_free, int *max_advance, bool use_lock); static int XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli, - bool fromArchive, bool notexistOk); + int source, bool notexistOk); static int XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, - bool fromArchive); + int sources); static bool XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, bool randAccess); +static int emode_for_corrupt_record(int endofwalmode); static void XLogFileClose(void); static bool RestoreArchivedFile(char *path, const char *xlogfname, const char *recovername, off_t expectedSize); @@ -2573,7 +2587,7 @@ XLogFileOpen(uint32 log, uint32 seg) */ static int XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli, - bool fromArchive, bool notfoundOk) + int source, bool notfoundOk) { char xlogfname[MAXFNAMELEN]; char activitymsg[MAXFNAMELEN + 16]; @@ -2582,23 +2596,28 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli, XLogFileName(xlogfname, tli, log, seg); - if (fromArchive) + switch (source) { - /* Report recovery progress in PS display */ - snprintf(activitymsg, sizeof(activitymsg), "waiting for %s", - xlogfname); - set_ps_display(activitymsg, false); + case XLOG_FROM_ARCHIVE: + /* Report recovery progress in PS display */ + snprintf(activitymsg, sizeof(activitymsg), "waiting for %s", + xlogfname); + set_ps_display(activitymsg, false); - restoredFromArchive = RestoreArchivedFile(path, xlogfname, - "RECOVERYXLOG", - XLogSegSize); - if (!restoredFromArchive) - return -1; - } - else - { - XLogFilePath(path, tli, log, seg); - restoredFromArchive = false; + restoredFromArchive = RestoreArchivedFile(path, xlogfname, + "RECOVERYXLOG", + XLogSegSize); + if (!restoredFromArchive) + return -1; + break; + + case XLOG_FROM_PG_XLOG: + XLogFilePath(path, tli, log, seg); + restoredFromArchive = false; + break; + + default: + elog(ERROR, "invalid XLogFileRead source %d", source); } fd = BasicOpenFile(path, O_RDONLY | PG_BINARY, 0); @@ -2612,6 +2631,8 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli, xlogfname); set_ps_display(activitymsg, false); + readSource = source; + return fd; } if (errno != ENOENT || !notfoundOk) /* unexpected failure? */ @@ -2630,7 +2651,7 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli, * searched in pg_xlog if not found in archive. */ static int -XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, bool fromArchive) +XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, int sources) { char path[MAXPGPATH]; ListCell *cell; @@ -2653,20 +2674,19 @@ XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, bool fromArchive) if (tli < curFileTLI) break; /* don't bother looking at too-old TLIs */ - fd = XLogFileRead(log, seg, emode, tli, fromArchive, true); - if (fd != -1) - return fd; + if (sources & XLOG_FROM_ARCHIVE) + { + fd = XLogFileRead(log, seg, emode, tli, XLOG_FROM_ARCHIVE, true); + if (fd != -1) + { + elog(DEBUG1, "got WAL segment from archive"); + return fd; + } + } - /* - * If not in StandbyMode, fall back to searching pg_xlog. In - * StandbyMode we're streaming segments from the primary to pg_xlog, - * and we mustn't confuse the (possibly partial) segments in pg_xlog - * with complete segments ready to be applied. We rather wait for the - * records to arrive through streaming. - */ - if (!StandbyMode && fromArchive) + if (sources & XLOG_FROM_PG_XLOG) { - fd = XLogFileRead(log, seg, emode, tli, false, true); + fd = XLogFileRead(log, seg, emode, tli, XLOG_FROM_PG_XLOG, true); if (fd != -1) return fd; } @@ -3520,7 +3540,7 @@ RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode) * the returned record pointer always points there. */ static XLogRecord * -ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt) +ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt) { XLogRecord *record; char *buffer; @@ -3530,17 +3550,6 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt) total_len; uint32 targetRecOff; uint32 pageHeaderSize; - int emode; - - /* - * We don't expect any invalid records during streaming recovery: we - * should never hit the end of WAL because we wait for it to be streamed. - * Therefore treat any broken WAL as PANIC, instead of failing over. - */ - if (StandbyMode) - emode = PANIC; - else - emode = emode_arg; if (readBuf == NULL) { @@ -3593,6 +3602,9 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt) randAccess = true; /* allow curFileTLI to go backwards too */ } + /* This is the first try to read this page. */ + failedSources = 0; +retry: /* Read the page containing the record */ if (!XLogPageRead(RecPtr, emode, fetching_ckpt, randAccess)) return NULL; @@ -3611,7 +3623,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt) } else if (targetRecOff < pageHeaderSize) { - ereport(emode, + ereport(emode_for_corrupt_record(emode), (errmsg("invalid record offset at %X/%X", RecPtr->xlogid, RecPtr->xrecoff))); goto next_record_is_invalid; @@ -3619,7 +3631,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt) if ((((XLogPageHeader) readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) && targetRecOff == pageHeaderSize) { - ereport(emode, + ereport(emode_for_corrupt_record(emode), (errmsg("contrecord is requested by %X/%X", RecPtr->xlogid, RecPtr->xrecoff))); goto next_record_is_invalid; @@ -3634,7 +3646,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt) { if (record->xl_len != 0) { - ereport(emode, + ereport(emode_for_corrupt_record(emode), (errmsg("invalid xlog switch record at %X/%X", RecPtr->xlogid, RecPtr->xrecoff))); goto next_record_is_invalid; @@ -3642,7 +3654,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt) } else if (record->xl_len == 0) { - ereport(emode, + ereport(emode_for_corrupt_record(emode), (errmsg("record with zero length at %X/%X", RecPtr->xlogid, RecPtr->xrecoff))); goto next_record_is_invalid; @@ -3651,14 +3663,14 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt) record->xl_tot_len > SizeOfXLogRecord + record->xl_len + XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ)) { - ereport(emode, + ereport(emode_for_corrupt_record(emode), (errmsg("invalid record length at %X/%X", RecPtr->xlogid, RecPtr->xrecoff))); goto next_record_is_invalid; } if (record->xl_rmid > RM_MAX_ID) { - ereport(emode, + ereport(emode_for_corrupt_record(emode), (errmsg("invalid resource manager ID %u at %X/%X", record->xl_rmid, RecPtr->xlogid, RecPtr->xrecoff))); goto next_record_is_invalid; @@ -3671,7 +3683,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt) */ if (!XLByteLT(record->xl_prev, *RecPtr)) { - ereport(emode, + ereport(emode_for_corrupt_record(emode), (errmsg("record with incorrect prev-link %X/%X at %X/%X", record->xl_prev.xlogid, record->xl_prev.xrecoff, RecPtr->xlogid, RecPtr->xrecoff))); @@ -3687,7 +3699,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt) */ if (!XLByteEQ(record->xl_prev, ReadRecPtr)) { - ereport(emode, + ereport(emode_for_corrupt_record(emode), (errmsg("record with incorrect prev-link %X/%X at %X/%X", record->xl_prev.xlogid, record->xl_prev.xrecoff, RecPtr->xlogid, RecPtr->xrecoff))); @@ -3716,7 +3728,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt) { readRecordBufSize = 0; /* We treat this as a "bogus data" condition */ - ereport(emode, + ereport(emode_for_corrupt_record(emode), (errmsg("record length %u at %X/%X too long", total_len, RecPtr->xlogid, RecPtr->xrecoff))); goto next_record_is_invalid; @@ -3756,7 +3768,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt) /* Check that the continuation record looks valid */ if (!(((XLogPageHeader) readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD)) { - ereport(emode, + ereport(emode_for_corrupt_record(emode), (errmsg("there is no contrecord flag in log file %u, segment %u, offset %u", readId, readSeg, readOff))); goto next_record_is_invalid; @@ -3766,7 +3778,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt) if (contrecord->xl_rem_len == 0 || total_len != (contrecord->xl_rem_len + gotlen)) { - ereport(emode, + ereport(emode_for_corrupt_record(emode), (errmsg("invalid contrecord length %u in log file %u, segment %u, offset %u", contrecord->xl_rem_len, readId, readSeg, readOff))); @@ -3784,7 +3796,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt) contrecord->xl_rem_len); break; } - if (!RecordIsValid(record, *RecPtr, emode)) + if (!RecordIsValid(record, *RecPtr, emode_for_corrupt_record(emode))) goto next_record_is_invalid; pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf); EndRecPtr.xlogid = readId; @@ -3798,7 +3810,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt) } /* Record does not cross a page boundary */ - if (!RecordIsValid(record, *RecPtr, emode)) + if (!RecordIsValid(record, *RecPtr, emode_for_corrupt_record(emode))) goto next_record_is_invalid; EndRecPtr.xlogid = RecPtr->xlogid; EndRecPtr.xrecoff = RecPtr->xrecoff + MAXALIGN(total_len); @@ -3824,13 +3836,20 @@ ReadRecord(XLogRecPtr *RecPtr, int emode_arg, bool fetching_ckpt) } return (XLogRecord *) buffer; -next_record_is_invalid:; +next_record_is_invalid: + failedSources |= readSource; + if (readFile >= 0) { close(readFile); readFile = -1; } - return NULL; + + /* In standby-mode, keep trying */ + if (StandbyMode) + goto retry; + else + return NULL; } /* @@ -8731,10 +8750,15 @@ StartupProcessMain(void) /* * Read the XLOG page containing RecPtr into readBuf (if not read already). - * Returns true if successful, false otherwise or fails if emode is PANIC. + * Returns true if the page is read successfully. * * This is responsible for restoring files from archive as needed, as well * as for waiting for the requested WAL record to arrive in standby mode. + * + * 'emode' specifies the log level used for reporting "file not found" or + * "end of WAL" situations in archive recovery, or in standby mode if a trigger + * file is found. If set to WARNING or below, XLogPageRead() returns false + * in those situations, otherwise the ereport() will cause an error exit. */ static bool XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, @@ -8746,13 +8770,14 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, uint32 targetRecOff; uint32 targetId; uint32 targetSeg; + static pg_time_t last_fail_time = 0; XLByteToSeg(*RecPtr, targetId, targetSeg); targetPageOff = ((RecPtr->xrecoff % XLogSegSize) / XLOG_BLCKSZ) * XLOG_BLCKSZ; targetRecOff = RecPtr->xrecoff % XLOG_BLCKSZ; /* Fast exit if we have read the record in the current buffer already */ - if (targetId == readId && targetSeg == readSeg && + if (failedSources == 0 && targetId == readId && targetSeg == readSeg && targetPageOff == readOff && targetRecOff < readLen) return true; @@ -8764,18 +8789,18 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, { close(readFile); readFile = -1; + readSource = 0; } XLByteToSeg(*RecPtr, readId, readSeg); +retry: /* See if we need to retrieve more data */ if (readFile < 0 || - (readStreamed && !XLByteLT(*RecPtr, receivedUpto))) + (readSource == XLOG_FROM_STREAM && !XLByteLT(*RecPtr, receivedUpto))) { if (StandbyMode) { - bool last_restore_failed = false; - /* * In standby mode, wait for the requested record to become * available, either via restore_command succeeding to restore the @@ -8800,15 +8825,16 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, { readFile = XLogFileRead(readId, readSeg, PANIC, - recoveryTargetTLI, false, false); + recoveryTargetTLI, + XLOG_FROM_PG_XLOG, false); switched_segment = true; - readStreamed = true; + readSource = XLOG_FROM_STREAM; } break; } if (CheckForStandbyTrigger()) - goto next_record_is_invalid; + goto triggered; /* * When streaming is active, we want to react quickly when @@ -8818,6 +8844,9 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, } else { + int sources; + pg_time_t now; + /* * Until walreceiver manages to reconnect, poll the * archive. @@ -8830,48 +8859,73 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, /* Reset curFileTLI if random fetch. */ if (randAccess) curFileTLI = 0; - readFile = XLogFileReadAnyTLI(readId, readSeg, DEBUG2, true); - switched_segment = true; - readStreamed = false; - if (readFile != -1) - { - elog(DEBUG1, "got WAL segment from archive"); - break; - } /* - * If we succeeded restoring some segments from archive - * since the last connection attempt (or we haven't tried - * streaming yet, retry immediately. But if we haven't, - * assume the problem is persistent, so be less - * aggressive. + * Try to restore the file from archive, or read an + * existing file from pg_xlog. */ - if (last_restore_failed) + sources = XLOG_FROM_ARCHIVE | XLOG_FROM_PG_XLOG; + if (!(sources & ~failedSources)) { /* - * Check to see if the trigger file exists. Note that - * we do this only after failure, so when you create - * the trigger file, we still finish replaying as much - * as we can before failover. + * We've exhausted all options for retrieving the + * file. Retry ... */ - if (CheckForStandbyTrigger()) - goto next_record_is_invalid; - pg_usleep(5000000L); /* 5 seconds */ + failedSources = 0; + + /* + * ... but sleep first if it hasn't been long since + * last attempt. + */ + now = (pg_time_t) time(NULL); + if ((now - last_fail_time) < 5) + { + pg_usleep(1000000L * (5 - (now - last_fail_time))); + now = (pg_time_t) time(NULL); + } + last_fail_time = now; + + /* + * If primary_conninfo is set, launch walreceiver to + * try to stream the missing WAL, before retrying + * to restore from archive/pg_xlog. + * + * If fetching_ckpt is TRUE, RecPtr points to the + * initial checkpoint location. In that case, we use + * RedoStartLSN as the streaming start position instead + * of RecPtr, so that when we later jump backwards to + * start redo at RedoStartLSN, we will have the logs + * streamed already. + */ + if (PrimaryConnInfo) + { + RequestXLogStreaming( + fetching_ckpt ? RedoStartLSN : *RecPtr, + PrimaryConnInfo); + continue; + } } - last_restore_failed = true; + /* Don't try to read from a source that just failed */ + sources &= ~failedSources; + readFile = XLogFileReadAnyTLI(readId, readSeg, DEBUG2, + sources); + switched_segment = true; + if (readFile != -1) + break; /* - * Nope, not found in archive. Try to stream it. - * - * If fetching_ckpt is TRUE, RecPtr points to the initial - * checkpoint location. In that case, we use RedoStartLSN - * as the streaming start position instead of RecPtr, so - * that when we later jump backwards to start redo at - * RedoStartLSN, we will have the logs streamed already. + * Nope, not found in archive and/or pg_xlog. */ - if (PrimaryConnInfo) - RequestXLogStreaming(fetching_ckpt ? RedoStartLSN : *RecPtr, - PrimaryConnInfo); + failedSources |= sources; + + /* + * Check to see if the trigger file exists. Note that + * we do this only after failure, so when you create + * the trigger file, we still finish replaying as much + * as we can from archive and pg_xlog before failover. + */ + if (CheckForStandbyTrigger()) + goto triggered; } /* @@ -8886,13 +8940,18 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, /* In archive or crash recovery. */ if (readFile < 0) { + int sources; /* Reset curFileTLI if random fetch. */ if (randAccess) curFileTLI = 0; + + sources = XLOG_FROM_PG_XLOG; + if (InArchiveRecovery) + sources |= XLOG_FROM_ARCHIVE; + readFile = XLogFileReadAnyTLI(readId, readSeg, emode, - InArchiveRecovery); + sources); switched_segment = true; - readStreamed = false; if (readFile < 0) return false; } @@ -8900,8 +8959,8 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, } /* - * At this point, we have the right segment open and we know the requested - * record is in it. + * At this point, we have the right segment open and if we're streaming + * we know the requested record is in it. */ Assert(readFile != -1); @@ -8911,7 +8970,7 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, * requested record has been received, but this is for the benefit of * future calls, to allow quick exit at the top of this function. */ - if (readStreamed) + if (readSource == XLOG_FROM_STREAM) { if (RecPtr->xlogid != receivedUpto.xlogid || (RecPtr->xrecoff / XLOG_BLCKSZ) != (receivedUpto.xrecoff / XLOG_BLCKSZ)) @@ -8936,13 +8995,14 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, readOff = 0; if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ) { - ereport(emode, + ereport(emode_for_corrupt_record(emode), (errcode_for_file_access(), errmsg("could not read from log file %u, segment %u, offset %u: %m", readId, readSeg, readOff))); goto next_record_is_invalid; } - if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode)) + if (!ValidXLOGHeader((XLogPageHeader) readBuf, + emode_for_corrupt_record(emode))) goto next_record_is_invalid; } @@ -8950,7 +9010,7 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, readOff = targetPageOff; if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0) { - ereport(emode, + ereport(emode_for_corrupt_record(emode), (errcode_for_file_access(), errmsg("could not seek in log file %u, segment %u to offset %u: %m", readId, readSeg, readOff))); @@ -8958,13 +9018,13 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, } if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ) { - ereport(emode, + ereport(emode_for_corrupt_record(emode), (errcode_for_file_access(), errmsg("could not read from log file %u, segment %u, offset %u: %m", readId, readSeg, readOff))); goto next_record_is_invalid; } - if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode)) + if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode_for_corrupt_record(emode))) goto next_record_is_invalid; Assert(targetId == readId); @@ -8975,16 +9035,67 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt, return true; next_record_is_invalid: + failedSources |= readSource; + if (readFile >= 0) close(readFile); readFile = -1; - readStreamed = false; readLen = 0; + readSource = 0; + + /* In standby-mode, keep trying */ + if (StandbyMode) + goto retry; + else + return false; + +triggered: + if (readFile >= 0) + close(readFile); + readFile = -1; + readLen = 0; + readSource = 0; return false; } /* + * Determine what log level should be used to report a corrupt WAL record + * in the current WAL page, previously read by XLogPageRead(). + * + * 'emode' is the error mode that would be used to report a file-not-found + * or legitimate end-of-WAL situation. It is upgraded to WARNING or PANIC + * if the an corrupt record is not expected at this point. + */ +static int +emode_for_corrupt_record(int emode) +{ + /* + * We don't expect any invalid records in archive or in records streamed + * from master. Files in the archive should be complete, and we should + * never hit the end of WAL because we stop and wait for more WAL to + * arrive before replaying it. + * + * In standby mode, throw a WARNING and keep retrying. If we're lucky + * it's a transient error and will go away by itself, and in any case + * it's better to keep the standby open for any possible read-only + * queries. In PITR, however, stop recovery immediately, rather than + * fail over. + */ + if (readSource == XLOG_FROM_STREAM || readSource == XLOG_FROM_ARCHIVE) + { + if (StandbyMode) + { + if (emode < WARNING) + emode = WARNING; + } + else + emode = PANIC; + } + return emode; +} + +/* * Check to see if the trigger file exists. If it does, request postmaster * to shut down walreceiver, wait for it to exit, remove the trigger * file, and return true.
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Robert Haas
Date:
On Thu, Mar 25, 2010 at 8:55 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > * If a corrupt WAL record is found in archive or streamed from master in > standby mode, throw WARNING instead of PANIC, and keep trying. In > archive recovery (ie. standby_mode=off) it's still a PANIC. We can make > it a WARNING too, which gives the pre-9.0 behavior of starting up the > server on corruption. I prefer PANIC but the discussion is still going on. I don't think we should be changing pre-9.0 behavior more than necessary. ...Robert
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Simon Riggs
Date:
On Thu, 2010-03-25 at 12:15 +0200, Heikki Linnakangas wrote: > (cc'ing docs list) > > Simon Riggs wrote: > > The lack of docs begins to show a lack of coherent high-level design > > here. > > Yeah, I think you're right. It's becoming hard to keep track of how it's > supposed to behave. Thank you for responding to that comment positively, I am relieved. -- Simon Riggs www.2ndQuadrant.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Simon Riggs
Date:
On Thu, 2010-03-25 at 12:26 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Thu, 2010-03-25 at 10:11 +0200, Heikki Linnakangas wrote: > > > >> PANIC seems like the appropriate solution for now. > > > > It definitely is not. Think some more. > > Well, what happens now in previous versions with pg_standby et al is > that the standby starts up. That doesn't seem appropriate either. Agreed. I said that also, immediately upthread. Bottom line is I am against anyone being allowed to PANIC the server just because their piece of it ain't working. The whole purpose of all of this is High Availability and we don't get that if everybody keeps stopping for a tea break every time things get tricky. Staying up when problems occur is the only way to avoid a falling domino taking out the whole farm. > I'm worried that the administrator won't notice the error promptly > because at a quick glance the server is up and running, while it's > actually stuck at the error and falling indefinitely behind the master. > Maybe if we make it a WARNING, that's enough to alleviate that. It's > true that if the standby is actively being used for read-only queries, > shutting it down to just get the administrators attention isn't good either. That's what monitoring is for. Let's just make sure this state is accessible, so people will notice. -- Simon Riggs www.2ndQuadrant.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Fujii Masao
Date:
On Thu, Mar 25, 2010 at 9:55 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > * Fix the bug of a spurious PANIC in archive recovery, if the WAL ends > in the middle of a WAL record that continues over a WAL segment boundary. > > * If a corrupt WAL record is found in archive or streamed from master in > standby mode, throw WARNING instead of PANIC, and keep trying. In > archive recovery (ie. standby_mode=off) it's still a PANIC. We can make > it a WARNING too, which gives the pre-9.0 behavior of starting up the > server on corruption. I prefer PANIC but the discussion is still going on. Seems reasonable for me. > * Small code changes to handling of failedSources, inspired by your > comment. No change in functionality. > > This is also available in my git repository at > git://git.postgresql.org/git/users/heikki/postgres.git, branch "xlogchanges" I looked the patch and was not able to find any big problems until now. The attached small patch fixes the typo. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Heikki Linnakangas
Date:
Fujii Masao wrote: >> * Small code changes to handling of failedSources, inspired by your >> comment. No change in functionality. >> >> This is also available in my git repository at >> git://git.postgresql.org/git/users/heikki/postgres.git, branch "xlogchanges" > > I looked the patch and was not able to find any big problems until now. > The attached small patch fixes the typo. Thanks. Committed with that typo-fix, and I also added a comment explaining how failedSources and retrying XLogPageRead() works. I'm now happy with the standby mode logic. It was a bigger struggle than I anticipated back in January/February, I hope others now find it intuitive as well. I'm going to work on the documentation of this, along the lines of the draft I posted last week. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL
From
Fujii Masao
Date:
On Wed, Mar 31, 2010 at 1:28 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Fujii Masao wrote: >>> * Small code changes to handling of failedSources, inspired by your >>> comment. No change in functionality. >>> >>> This is also available in my git repository at >>> git://git.postgresql.org/git/users/heikki/postgres.git, branch "xlogchanges" >> >> I looked the patch and was not able to find any big problems until now. >> The attached small patch fixes the typo. > > Thanks. Committed with that typo-fix, and I also added a comment > explaining how failedSources and retrying XLogPageRead() works. Great! Thanks a lot! This change affects various recovery patterns that we support now. For example, normal crash recovery, archive recovery, recovery using pg_standby, recovery in standby mode, and so on. So we would need to test whether all of those patterns work fine with the change. I'll do it as much as possible. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center