Thread: Incomplete docs for restore_command for hot standby

Incomplete docs for restore_command for hot standby

From
"Markus Bertheau"
Date:
(I sent this to -docs already, but it didn't get through for some reason.)

From the current 8.3 docs:

Section 24.3.3.1 states about restore_command:

"The command will be asked for file names that are not present in the
archive; it must return nonzero when so asked."

Section 24.4.1 further states:

"The magic that makes the two loosely coupled servers work together is
simply a restore_command used on the standby that waits for the next
WAL file to become available from the primary."

It is not clear from the first paragraph, whether the non-existing
file that restore_command is being asked for is a not-yet-generated
WAL file or something different. If it was a not-yet-generated WAL
file, restore_command for replication would have to wait for it to
appear. If it was something different, restore_command for replication
would have to return an error right away. (Because else it would hang
indefinitely, waiting for a file that is not going to appear). Yet I
couldn't find hints in the documentation as to how these two cases can
be detected by restore_command, i.e. how restore_command should tell a
request for a WAL file from a request for a non-WAL file.

Practice (http://archives.postgresql.org/sydpug/2006-10/msg00001.php)
shows that this is a problem, and people use unproved heuristics
('history' substring in the requested file name).

Additionally, 24.3.3 contains slightly misleading information:

"It is important that the command return nonzero exit status on
failure. The command will be asked for log files that are not present
in the archive; it must return nonzero when so asked. This is not an
error condition."

This suggests that all non-existing files that restore_command will be
asked for are log files. One could therefore reasonably assume that
restore_command for replication should wait on all non-existing files.
24.3.3.1 later corrects this by stating that not only log files may be
requested, but nevertheless.

Markus

Re: Incomplete docs for restore_command for hot standby

From
Simon Riggs
Date:
On Thu, 2008-02-21 at 08:01 +0600, Markus Bertheau wrote:
> (I sent this to -docs already, but it didn't get through for some reason.)
>
> >From the current 8.3 docs:
>
> Section 24.3.3.1 states about restore_command:
>
> "The command will be asked for file names that are not present in the
> archive; it must return nonzero when so asked."
>
> Section 24.4.1 further states:
>
> "The magic that makes the two loosely coupled servers work together is
> simply a restore_command used on the standby that waits for the next
> WAL file to become available from the primary."
>
> It is not clear from the first paragraph, whether the non-existing
> file that restore_command is being asked for is a not-yet-generated
> WAL file or something different. If it was a not-yet-generated WAL
> file, restore_command for replication would have to wait for it to
> appear. If it was something different, restore_command for replication
> would have to return an error right away. (Because else it would hang
> indefinitely, waiting for a file that is not going to appear). Yet I
> couldn't find hints in the documentation as to how these two cases can
> be detected by restore_command, i.e. how restore_command should tell a
> request for a WAL file from a request for a non-WAL file.

The two sentences aren't mutually exclusive, especially when you
consider they are discussing two different use cases. Why not read up on
pg_standby anyway?

> Practice (http://archives.postgresql.org/sydpug/2006-10/msg00001.php)
> shows that this is a problem, and people use unproved heuristics
> ('history' substring in the requested file name).

Old email written during beta. Read at your own peril.

> Additionally, 24.3.3 contains slightly misleading information:
>
> "It is important that the command return nonzero exit status on
> failure. The command will be asked for log files that are not present
> in the archive; it must return nonzero when so asked. This is not an
> error condition."
>
> This suggests that all non-existing files that restore_command will be
> asked for are log files. One could therefore reasonably assume that
> restore_command for replication should wait on all non-existing files.
> 24.3.3.1 later corrects this by stating that not only log files may be
> requested, but nevertheless.

If you have some suggested changes, I'd be happy to hear them.

Probably additions are better than just changes though.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com

Re: Incomplete docs for restore_command for hot standby

From
"Markus Bertheau"
Date:
2008/2/22, Simon Riggs <simon@2ndquadrant.com>:
> On Thu, 2008-02-21 at 08:01 +0600, Markus Bertheau wrote:
>  >
>  > Section 24.3.3.1 states about restore_command:
>  >
>  > "The command will be asked for file names that are not present in the
>  > archive; it must return nonzero when so asked."
>  >
>  > Section 24.4.1 further states:
>  >
>  > "The magic that makes the two loosely coupled servers work together is
>  > simply a restore_command used on the standby that waits for the next
>  > WAL file to become available from the primary."
>  >
>  > It is not clear from the first paragraph, whether the non-existing
>  > file that restore_command is being asked for is a not-yet-generated
>  > WAL file or something different. If it was a not-yet-generated WAL
>  > file, restore_command for replication would have to wait for it to
>  > appear. If it was something different, restore_command for replication
>  > would have to return an error right away. (Because else it would hang
>  > indefinitely, waiting for a file that is not going to appear). Yet I
>  > couldn't find hints in the documentation as to how these two cases can
>  > be detected by restore_command, i.e. how restore_command should tell a
>  > request for a WAL file from a request for a non-WAL file.
>
>
> The two sentences aren't mutually exclusive, especially when you
>  consider they are discussing two different use cases. Why not read up on
>  pg_standby anyway?

I read about pg_standby, but this is not about solving a particular problem but
about missing information in the docs.

>  > Practice (http://archives.postgresql.org/sydpug/2006-10/msg00001.php)
>  > shows that this is a problem, and people use unproved heuristics
>  > ('history' substring in the requested file name).
>
>
> Old email written during beta. Read at your own peril.

The email may be old, but the problem at hand is still relevant.

>  > Additionally, 24.3.3 contains slightly misleading information:
>  >
>  > "It is important that the command return nonzero exit status on
>  > failure. The command will be asked for log files that are not present
>  > in the archive; it must return nonzero when so asked. This is not an
>  > error condition."
>  >
>  > This suggests that all non-existing files that restore_command will be
>  > asked for are log files. One could therefore reasonably assume that
>  > restore_command for replication should wait on all non-existing files.
>  > 24.3.3.1 later corrects this by stating that not only log files may be
>  > requested, but nevertheless.
>
>
> If you have some suggested changes, I'd be happy to hear them.
>
>  Probably additions are better than just changes though.

What about this:

*** a/doc/src/sgml/backup.sgml
--- b/doc/src/sgml/backup.sgml
***************
*** 1001,1011 **** restore_command = 'cp /mnt/server/archivedir/%f %p'

     <para>
      It is important that the command return nonzero exit status on failure.
!     The command <emphasis>will</> be asked for log files that are not present
!     in the archive; it must return nonzero when so asked.  This is not an
!     error condition.  Be aware also that the base name of the <literal>%p</>
!     path will be different from <literal>%f</>; do not expect them to be
!     interchangeable.
     </para>

     <para>
--- 1001,1011 ----

     <para>
      It is important that the command return nonzero exit status on failure.
!     The command <emphasis>will</> be asked for log and other files that are
!     not present in the archive; it must return nonzero when so asked.  This is
!     not an error condition.  Be aware also that the base name of the
!     <literal>%p</> path will be different from <literal>%f</>; do not expect
!     them to be interchangeable.
     </para>

     <para>
***************
*** 1576,1594 **** archive_command = 'local_backup_script.sh'

     <para>
      The magic that makes the two loosely coupled servers work together is
!     simply a <varname>restore_command</> used on the standby that waits
!     for the next WAL file to become available from the primary. The
!     <varname>restore_command</> is specified in the
      <filename>recovery.conf</> file on the standby server. Normal recovery
      processing would request a file from the WAL archive, reporting failure
      if the file was unavailable.  For standby processing it is normal for
!     the next file to be unavailable, so we must be patient and wait for
!     it to appear. A waiting <varname>restore_command</> can be written as
!     a custom script that loops after polling for the existence of the next
!     WAL file. There must also be some way to trigger failover, which should
!     interrupt the <varname>restore_command</>, break the loop and return
!     a file-not-found error to the standby server. This ends recovery and
!     the standby will then come up as a normal server.
     </para>

     <para>
--- 1576,1596 ----

     <para>
      The magic that makes the two loosely coupled servers work together is
!     simply a <varname>restore_command</> used on the standby that, when asked
!     for the a WAL file, waits for it to become available from the primary.
!     The <varname>restore_command</> is specified in the
      <filename>recovery.conf</> file on the standby server. Normal recovery
      processing would request a file from the WAL archive, reporting failure
      if the file was unavailable.  For standby processing it is normal for
!     the next WAL file to be unavailable, so we must be patient and wait for
!     it to appear. For non-WAL files though the script must still report
!     failure. WAL files can be distinguished from non-WAL files by FIXME. A
!     waiting <varname>restore_command</> can be written as a custom script that
!     loops after polling for the existence of the next WAL file. There must
!     also be some way to trigger failover, which should interrupt the
!     <varname>restore_command</>, break the loop and return a file-not-found
!     error to the standby server. This ends recovery and the standby will then
!     come up as a normal server.
     </para>

     <para>

The FIXME of course needs replacement by someone in the know.

Markus Bertheau
Blog: http://www.bluetwanger.de/blog/

Re: [PATCHES] Incomplete docs for restore_command for hot standby

From
Bruce Momjian
Date:
Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


Markus Bertheau wrote:
> 2008/2/22, Simon Riggs <simon@2ndquadrant.com>:
> > On Thu, 2008-02-21 at 08:01 +0600, Markus Bertheau wrote:
> >  >
> >  > Section 24.3.3.1 states about restore_command:
> >  >
> >  > "The command will be asked for file names that are not present in the
> >  > archive; it must return nonzero when so asked."
> >  >
> >  > Section 24.4.1 further states:
> >  >
> >  > "The magic that makes the two loosely coupled servers work together is
> >  > simply a restore_command used on the standby that waits for the next
> >  > WAL file to become available from the primary."
> >  >
> >  > It is not clear from the first paragraph, whether the non-existing
> >  > file that restore_command is being asked for is a not-yet-generated
> >  > WAL file or something different. If it was a not-yet-generated WAL
> >  > file, restore_command for replication would have to wait for it to
> >  > appear. If it was something different, restore_command for replication
> >  > would have to return an error right away. (Because else it would hang
> >  > indefinitely, waiting for a file that is not going to appear). Yet I
> >  > couldn't find hints in the documentation as to how these two cases can
> >  > be detected by restore_command, i.e. how restore_command should tell a
> >  > request for a WAL file from a request for a non-WAL file.
> >
> >
> > The two sentences aren't mutually exclusive, especially when you
> >  consider they are discussing two different use cases. Why not read up on
> >  pg_standby anyway?
>
> I read about pg_standby, but this is not about solving a particular problem but
> about missing information in the docs.
>
> >  > Practice (http://archives.postgresql.org/sydpug/2006-10/msg00001.php)
> >  > shows that this is a problem, and people use unproved heuristics
> >  > ('history' substring in the requested file name).
> >
> >
> > Old email written during beta. Read at your own peril.
>
> The email may be old, but the problem at hand is still relevant.
>
> >  > Additionally, 24.3.3 contains slightly misleading information:
> >  >
> >  > "It is important that the command return nonzero exit status on
> >  > failure. The command will be asked for log files that are not present
> >  > in the archive; it must return nonzero when so asked. This is not an
> >  > error condition."
> >  >
> >  > This suggests that all non-existing files that restore_command will be
> >  > asked for are log files. One could therefore reasonably assume that
> >  > restore_command for replication should wait on all non-existing files.
> >  > 24.3.3.1 later corrects this by stating that not only log files may be
> >  > requested, but nevertheless.
> >
> >
> > If you have some suggested changes, I'd be happy to hear them.
> >
> >  Probably additions are better than just changes though.
>
> What about this:
>
> *** a/doc/src/sgml/backup.sgml
> --- b/doc/src/sgml/backup.sgml
> ***************
> *** 1001,1011 **** restore_command = 'cp /mnt/server/archivedir/%f %p'
>
>      <para>
>       It is important that the command return nonzero exit status on failure.
> !     The command <emphasis>will</> be asked for log files that are not present
> !     in the archive; it must return nonzero when so asked.  This is not an
> !     error condition.  Be aware also that the base name of the <literal>%p</>
> !     path will be different from <literal>%f</>; do not expect them to be
> !     interchangeable.
>      </para>
>
>      <para>
> --- 1001,1011 ----
>
>      <para>
>       It is important that the command return nonzero exit status on failure.
> !     The command <emphasis>will</> be asked for log and other files that are
> !     not present in the archive; it must return nonzero when so asked.  This is
> !     not an error condition.  Be aware also that the base name of the
> !     <literal>%p</> path will be different from <literal>%f</>; do not expect
> !     them to be interchangeable.
>      </para>
>
>      <para>
> ***************
> *** 1576,1594 **** archive_command = 'local_backup_script.sh'
>
>      <para>
>       The magic that makes the two loosely coupled servers work together is
> !     simply a <varname>restore_command</> used on the standby that waits
> !     for the next WAL file to become available from the primary. The
> !     <varname>restore_command</> is specified in the
>       <filename>recovery.conf</> file on the standby server. Normal recovery
>       processing would request a file from the WAL archive, reporting failure
>       if the file was unavailable.  For standby processing it is normal for
> !     the next file to be unavailable, so we must be patient and wait for
> !     it to appear. A waiting <varname>restore_command</> can be written as
> !     a custom script that loops after polling for the existence of the next
> !     WAL file. There must also be some way to trigger failover, which should
> !     interrupt the <varname>restore_command</>, break the loop and return
> !     a file-not-found error to the standby server. This ends recovery and
> !     the standby will then come up as a normal server.
>      </para>
>
>      <para>
> --- 1576,1596 ----
>
>      <para>
>       The magic that makes the two loosely coupled servers work together is
> !     simply a <varname>restore_command</> used on the standby that, when asked
> !     for the a WAL file, waits for it to become available from the primary.
> !     The <varname>restore_command</> is specified in the
>       <filename>recovery.conf</> file on the standby server. Normal recovery
>       processing would request a file from the WAL archive, reporting failure
>       if the file was unavailable.  For standby processing it is normal for
> !     the next WAL file to be unavailable, so we must be patient and wait for
> !     it to appear. For non-WAL files though the script must still report
> !     failure. WAL files can be distinguished from non-WAL files by FIXME. A
> !     waiting <varname>restore_command</> can be written as a custom script that
> !     loops after polling for the existence of the next WAL file. There must
> !     also be some way to trigger failover, which should interrupt the
> !     <varname>restore_command</>, break the loop and return a file-not-found
> !     error to the standby server. This ends recovery and the standby will then
> !     come up as a normal server.
>      </para>
>
>      <para>
>
> The FIXME of course needs replacement by someone in the know.
>
> Markus Bertheau
> Blog: http://www.bluetwanger.de/blog/
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Incomplete docs for restore_command for hot standby

From
Bruce Momjian
Date:
Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


Markus Bertheau wrote:
> 2008/2/22, Simon Riggs <simon@2ndquadrant.com>:
> > On Thu, 2008-02-21 at 08:01 +0600, Markus Bertheau wrote:
> >  >
> >  > Section 24.3.3.1 states about restore_command:
> >  >
> >  > "The command will be asked for file names that are not present in the
> >  > archive; it must return nonzero when so asked."
> >  >
> >  > Section 24.4.1 further states:
> >  >
> >  > "The magic that makes the two loosely coupled servers work together is
> >  > simply a restore_command used on the standby that waits for the next
> >  > WAL file to become available from the primary."
> >  >
> >  > It is not clear from the first paragraph, whether the non-existing
> >  > file that restore_command is being asked for is a not-yet-generated
> >  > WAL file or something different. If it was a not-yet-generated WAL
> >  > file, restore_command for replication would have to wait for it to
> >  > appear. If it was something different, restore_command for replication
> >  > would have to return an error right away. (Because else it would hang
> >  > indefinitely, waiting for a file that is not going to appear). Yet I
> >  > couldn't find hints in the documentation as to how these two cases can
> >  > be detected by restore_command, i.e. how restore_command should tell a
> >  > request for a WAL file from a request for a non-WAL file.
> >
> >
> > The two sentences aren't mutually exclusive, especially when you
> >  consider they are discussing two different use cases. Why not read up on
> >  pg_standby anyway?
>
> I read about pg_standby, but this is not about solving a particular problem but
> about missing information in the docs.
>
> >  > Practice (http://archives.postgresql.org/sydpug/2006-10/msg00001.php)
> >  > shows that this is a problem, and people use unproved heuristics
> >  > ('history' substring in the requested file name).
> >
> >
> > Old email written during beta. Read at your own peril.
>
> The email may be old, but the problem at hand is still relevant.
>
> >  > Additionally, 24.3.3 contains slightly misleading information:
> >  >
> >  > "It is important that the command return nonzero exit status on
> >  > failure. The command will be asked for log files that are not present
> >  > in the archive; it must return nonzero when so asked. This is not an
> >  > error condition."
> >  >
> >  > This suggests that all non-existing files that restore_command will be
> >  > asked for are log files. One could therefore reasonably assume that
> >  > restore_command for replication should wait on all non-existing files.
> >  > 24.3.3.1 later corrects this by stating that not only log files may be
> >  > requested, but nevertheless.
> >
> >
> > If you have some suggested changes, I'd be happy to hear them.
> >
> >  Probably additions are better than just changes though.
>
> What about this:
>
> *** a/doc/src/sgml/backup.sgml
> --- b/doc/src/sgml/backup.sgml
> ***************
> *** 1001,1011 **** restore_command = 'cp /mnt/server/archivedir/%f %p'
>
>      <para>
>       It is important that the command return nonzero exit status on failure.
> !     The command <emphasis>will</> be asked for log files that are not present
> !     in the archive; it must return nonzero when so asked.  This is not an
> !     error condition.  Be aware also that the base name of the <literal>%p</>
> !     path will be different from <literal>%f</>; do not expect them to be
> !     interchangeable.
>      </para>
>
>      <para>
> --- 1001,1011 ----
>
>      <para>
>       It is important that the command return nonzero exit status on failure.
> !     The command <emphasis>will</> be asked for log and other files that are
> !     not present in the archive; it must return nonzero when so asked.  This is
> !     not an error condition.  Be aware also that the base name of the
> !     <literal>%p</> path will be different from <literal>%f</>; do not expect
> !     them to be interchangeable.
>      </para>
>
>      <para>
> ***************
> *** 1576,1594 **** archive_command = 'local_backup_script.sh'
>
>      <para>
>       The magic that makes the two loosely coupled servers work together is
> !     simply a <varname>restore_command</> used on the standby that waits
> !     for the next WAL file to become available from the primary. The
> !     <varname>restore_command</> is specified in the
>       <filename>recovery.conf</> file on the standby server. Normal recovery
>       processing would request a file from the WAL archive, reporting failure
>       if the file was unavailable.  For standby processing it is normal for
> !     the next file to be unavailable, so we must be patient and wait for
> !     it to appear. A waiting <varname>restore_command</> can be written as
> !     a custom script that loops after polling for the existence of the next
> !     WAL file. There must also be some way to trigger failover, which should
> !     interrupt the <varname>restore_command</>, break the loop and return
> !     a file-not-found error to the standby server. This ends recovery and
> !     the standby will then come up as a normal server.
>      </para>
>
>      <para>
> --- 1576,1596 ----
>
>      <para>
>       The magic that makes the two loosely coupled servers work together is
> !     simply a <varname>restore_command</> used on the standby that, when asked
> !     for the a WAL file, waits for it to become available from the primary.
> !     The <varname>restore_command</> is specified in the
>       <filename>recovery.conf</> file on the standby server. Normal recovery
>       processing would request a file from the WAL archive, reporting failure
>       if the file was unavailable.  For standby processing it is normal for
> !     the next WAL file to be unavailable, so we must be patient and wait for
> !     it to appear. For non-WAL files though the script must still report
> !     failure. WAL files can be distinguished from non-WAL files by FIXME. A
> !     waiting <varname>restore_command</> can be written as a custom script that
> !     loops after polling for the existence of the next WAL file. There must
> !     also be some way to trigger failover, which should interrupt the
> !     <varname>restore_command</>, break the loop and return a file-not-found
> !     error to the standby server. This ends recovery and the standby will then
> !     come up as a normal server.
>      </para>
>
>      <para>
>
> The FIXME of course needs replacement by someone in the know.
>
> Markus Bertheau
> Blog: http://www.bluetwanger.de/blog/
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [PATCHES] Incomplete docs for restore_command for hot standby

From
Simon Riggs
Date:
On Mon, 2008-02-25 at 17:56 +0600, Markus Bertheau wrote:
> 2008/2/22, Simon Riggs <simon@2ndquadrant.com>:

> > If you have some suggested changes, I'd be happy to hear them.
> >
> >  Probably additions are better than just changes though.
>
> What about this:
>
> *** a/doc/src/sgml/backup.sgml
> --- b/doc/src/sgml/backup.sgml
> ***************

...

> The FIXME of course needs replacement by someone in the know.

Doc patch edited to include all of Markus' points, tidy up some related
text and fix typos.

Good to apply to HEAD.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com

  PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk

Attachment

Re: [PATCHES] Incomplete docs for restore_command for hotstandby

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Mon, 2008-02-25 at 17:56 +0600, Markus Bertheau wrote:
>> The FIXME of course needs replacement by someone in the know.
>
> Doc patch edited to include all of Markus' points, tidy up some related
> text and fix typos.
>
> Good to apply to HEAD.

Committed to HEAD with minor fixes.

What's our policy wrt. back-patching doc changes? This seems applicable
to older versions as well, but do we do that?

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: [PATCHES] Incomplete docs for restore_command for hotstandby

From
Bruce Momjian
Date:
Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Mon, 2008-02-25 at 17:56 +0600, Markus Bertheau wrote:
> >> The FIXME of course needs replacement by someone in the know.
> >
> > Doc patch edited to include all of Markus' points, tidy up some related
> > text and fix typos.
> >
> > Good to apply to HEAD.
>
> Committed to HEAD with minor fixes.
>
> What's our policy wrt. back-patching doc changes? This seems applicable
> to older versions as well, but do we do that?

I do backpatch of doc changes if the change is serious.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +