Thread: [PATCH] Report exit code from external recovery commands properly

[PATCH] Report exit code from external recovery commands properly

From
Peter Eisentraut
Date:
When an external recovery command such as restore_command or
archive_cleanup_command fails, it just reports "return code 34567" or
something, but we have facilities to do decode this properly, so use
them.


Attachment

Re: [PATCH] Report exit code from external recovery commands properly

From
Peter Geoghegan
Date:
On Wed, Nov 13, 2013 at 3:42 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
> When an external recovery command such as restore_command or
> archive_cleanup_command fails, it just reports "return code 34567" or
> something, but we have facilities to do decode this properly, so use
> them.

I think this is a very good idea, but you should go a bit further:
document the special relationship restore_command has to special
return codes. Currently, the documentation says:

"It is important that the archive command return zero exit status if
and only if it succeeded. Upon getting a zero result, PostgreSQL will
assume that the WAL segment file has been successfully archived, and
will remove or recycle it. However, a nonzero status tells PostgreSQL
that the file was not archived; it will try again periodically until
it succeeds."

Yes, this concerns archive_command (where return code values that are
non-zero *are* never distinguished), but nothing much is said about
the return code of restore_command specifically anywhere else, so it's
implied that it's exactly inverse to archive_command. In reality, some
special return codes have a significance to restore_command: they make
recovery abort, because they're taking as proxies for various failures
that it isn't sensible to continue recovery in the event of.

We're talking about the difference between recovery aborting, and
recovery having conceptually "reached the end of the WAL stream", so
it's very surprising that this isn't documented currently.

-- 
Peter Geoghegan



Re: [PATCH] Report exit code from external recovery commands properly

From
Peter Eisentraut
Date:
On Wed, 2013-11-13 at 19:14 -0800, Peter Geoghegan wrote:
> I think this is a very good idea, but you should go a bit further:
> document the special relationship restore_command has to special
> return codes.

How about this?


Attachment

Re: [PATCH] Report exit code from external recovery commands properly

From
Peter Geoghegan
Date:
On Sun, Nov 24, 2013 at 5:11 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
> How about this?

Hmm. You say:

+        If the command returns a nonzero exit status then a warning log
+        message will be written.  An exception is that if the command was
+        terminated by a signal or an error by the shell (such as command not
+        found), a fatal error will be raised.

But in the case of the archiver, in contrast to the startup process,
this isn't really a big deal. It'll just pick up where it left off.
Whereas the reaper code shuts down the system if this happens in the
startup process. In my opinion that's a distinction that bears
emphasizing.


-- 
Peter Geoghegan



Re: [PATCH] Report exit code from external recovery commands properly

From
Peter Eisentraut
Date:
On Sat, 2013-11-30 at 15:56 -0800, Peter Geoghegan wrote:
> On Sun, Nov 24, 2013 at 5:11 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
> > How about this?
> 
> Hmm. You say:
> 
> +        If the command returns a nonzero exit status then a warning log
> +        message will be written.  An exception is that if the command was
> +        terminated by a signal or an error by the shell (such as command not
> +        found), a fatal error will be raised.
> 
> But in the case of the archiver, in contrast to the startup process,
> this isn't really a big deal. It'll just pick up where it left off.
> Whereas the reaper code shuts down the system if this happens in the
> startup process. In my opinion that's a distinction that bears
> emphasizing.

That snippet you quote is about archive_cleanup_command.  My patch
doesn't touch archive_command at all.

The current documentation about archive_command contains
  <para>   It is important that the archive command return zero exit status if and   only if it succeeds.  Upon getting
azero result,   <productname>PostgreSQL</> will assume that the file has been   successfully archived, and will remove
orrecycle it.  However, a nonzero   status tells <productname>PostgreSQL</> that the file was not archived;   it will
tryagain periodically until it succeeds.  </para>
 

which I think addresses your point.




Re: [PATCH] Report exit code from external recovery commands properly

From
Peter Geoghegan
Date:
On Mon, Dec 2, 2013 at 6:23 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> which I think addresses your point.

Fair enough. Marked "ready for committer".


-- 
Peter Geoghegan