Thread: Warm standby patch
I'm submitting this patch in attempt to clarify some issues with the warm standby documentation which caused some confusion in our organization and which have been recently discussed on the admin list. I apologize for posting twice, but I noticed that my pseudo-code was unnecessarily long and complex, and the two return points might offend some people. -Kevin Index: backup.sgml =================================================================== RCS file: /projects/cvsroot/pgsql/doc/src/sgml/backup.sgml,v retrieving revision 2.98 diff -c -r2.98 backup.sgml *** backup.sgml 29 Jun 2007 15:46:21 -0000 2.98 --- backup.sgml 29 Jun 2007 21:59:56 -0000 *************** *** 1402,1411 **** server. Normal recovery processing would request a file from the WAL archive, reporting failure if the file was unavailable. For standby processing it is normal for the next file to be ! unavailable, so we must be patient and wait for it to appear. A ! waiting <varname>restore_command</> can be written as a custom script that loops after polling for the existence of the next WAL ! file. There must also be some way to trigger failover, which should interrupt the <varname>restore_command</>, break the loop and return a file-not-found error to the standby server. This ends recovery and the standby will then come up as a normal --- 1402,1423 ---- server. Normal recovery processing would request a file from the WAL archive, reporting failure if the file was unavailable. For standby processing it is normal for the next file to be ! unavailable, so we must be patient and wait for it to appear. ! </para> ! ! <para> ! A waiting <varname>restore_command</> can be written as a custom script that loops after polling for the existence of the next WAL ! file. The script will occassionally be invoked with a request for ! a file other than a WAL file. Such a request can be identified ! by a file name which is anything except 24 hexadecimal characters. ! If the requested file is available, it should be copied; otherwise ! the script should return a file-not-found error. This does not ! always terminate recovery mode. ! </para> ! ! <para> ! There must also be some way to trigger failover, which should interrupt the <varname>restore_command</>, break the loop and return a file-not-found error to the standby server. This ends recovery and the standby will then come up as a normal *************** *** 1415,1430 **** <para> Pseudocode for a suitable <varname>restore_command</> is: <programlisting> ! triggered = false; ! while (!NextWALFileReady() && !triggered) ! { ! sleep(100000L); /* wait for ~0.1 sec */ ! if (CheckForExternalTrigger()) ! triggered = true; ! } ! if (!triggered) ! CopyWALFileForRecovery(); </programlisting> </para> <para> --- 1427,1443 ---- <para> Pseudocode for a suitable <varname>restore_command</> is: <programlisting> ! if (RequestedFileIsWALFile()) ! while (!NextWALFileReady() && !CheckForExternalTrigger()) ! sleep(pollingInterval); ! return CopyFileForRecovery(); </programlisting> + where CopyFileForRecovery() will return a file-not-found error when + the file doesn't exist, and zero for success when the copy succeeds. + If a large number is returned, the recovery process will interpret + this as indicating catastrophic failure, and will terminate the + standby instance of <productname>PostgreSQL</productname>, rather + than switching to a "ready" state. </para> <para>
On Fri, 2007-06-29 at 17:04 -0500, Kevin Grittner wrote: > I'm submitting this patch in attempt to clarify some issues with the > warm standby documentation which caused some confusion in our > organization and which have been recently discussed on the admin list. > > I apologize for posting twice, but I noticed that my pseudo-code > was unnecessarily long and complex, and the two return points might > offend some people. Looks OK, but this isn't specific enough. This confusion was one of the reasons I wrote contrib/pg_standby, at least to illustrate the handling of the files. Another confusion you may encounter is that if you copy the files as soon as they are available the files may not yet be fully written and so an incomplete file may be copied into place. I'll add more to the docs for the 8.3 changes. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
>>> On Mon, Jul 2, 2007 at 7:02 AM, in message <1183377737.10968.7.camel@silverbirch.site>, "Simon Riggs" <simon@2ndquadrant.com> wrote: > On Fri, 2007-06-29 at 17:04 -0500, Kevin Grittner wrote: >> I'm submitting this patch in attempt to clarify some issues with the >> warm standby documentation which caused some confusion in our >> organization and which have been recently discussed on the admin list. > > Looks OK, but this isn't specific enough. > > This confusion was one of the reasons I wrote contrib/pg_standby, at > least to illustrate the handling of the files. > > Another confusion you may encounter is that if you copy the files as > soon as they are available the files may not yet be fully written and so > an incomplete file may be copied into place. Yeah, a note about copying to a modified form of the name and then moving it to the specified name should be in there. Sorry I missed that when I wrote up the patch. While you're in there, please fix this redundancy: "and zero for success when the copy succeeds." Thanks. The pg_standby looks interesting, but is major overkill for us -- the script to do it directly is less code than configuration of pg_standby for us, and "fewer moving parts" to manage, so don't neglect those in our position. Of course, we're using it more to validate our backups and provide failover on the timescale of a few minutes after we've attempted to resolve problems, so it's no big deal to use touch or echo to create the file to trigger the switch to production mode. And it's nice having one ten-line bash script to handle 70 warm standbys on the machine. This technique has already caught one corruption of a WAL file moving across our WAN, allowing us to grab it again (uncorrupted) from its initial copy location on the LAN of origin. Also, the activation of an alternative server within a few minutes is a huge improvement over what we had with the commercial software we're switching from, which was an hour or two. After confirming that the remote site had indeed suffered an unrecoverable failure, we would have to load a backup centrally, apply all the database's transaction files, then "top it off" with transactions from our applications transaction repository (to get up to the last second). The whole process should be a matter of a few minutes with the PostgreSQL warm standby capabilities. -Kevin