Thread: archive_command failures report confusing exit status

archive_command failures report confusing exit status

From
Peter Eisentraut
Date:
Failures of archive_command calls report a confusing exit status such as:

LOG:  archive command "cp 'pg_xlog/000000010000000000000000' '/nonexistent/000000010000000000000000'" failed: return
code256
 

The actual return code is 1; it neglects to apply WEXITSTATUS().

I figured it would make sense if pgarch.c used the same mechanism that
postmaster.c uses to report the various variants of regular and signal
exits.

I have attached a patch in that direction.  It obviously needs a bit of
string struggling to get all the cases right, but the idea should be
clear.  Comments?

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: archive_command failures report confusing exit status

From
Alvaro Herrera
Date:
Peter Eisentraut wrote:

> I figured it would make sense if pgarch.c used the same mechanism that
> postmaster.c uses to report the various variants of regular and signal
> exits.

Hmm.  Getting rid of the "(PID 0)" is going to be a mess enough for
translations that I think it is worth pgarch.c having its own routine
for this.  Furthermore I think the detailed archive command should be
reported in an errdetail() field, which makes it even farther off.

(Hmm, there is nearly duplicate code in pclose_check already).

-- 
Alvaro Herrera                 http://www.amazon.com/gp/registry/DXLWNGRJD34J
"Nunca confiaré en un traidor.  Ni siquiera si el traidor lo he creado yo"
(Barón Vladimir Harkonnen)


Re: archive_command failures report confusing exit status

From
Simon Riggs
Date:
On Mon, 2007-12-10 at 18:27 +0100, Peter Eisentraut wrote:
> Failures of archive_command calls report a confusing exit status such as:
> 
> LOG:  archive command "cp 'pg_xlog/000000010000000000000000' '/nonexistent/000000010000000000000000'" failed: return
code256
 
> 
> The actual return code is 1; it neglects to apply WEXITSTATUS().

Yes, you're right.

> I figured it would make sense if pgarch.c used the same mechanism that
> postmaster.c uses to report the various variants of regular and signal
> exits.

OK, you could sell me on that. How will you get rid of (PID = 0)?

Maybe it would be better to do that on two lines:
LOG: archive command "............." failed
FATAL: archive_command was terminated by signal ... 

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: archive_command failures report confusing exit status

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Peter Eisentraut wrote:
>> I figured it would make sense if pgarch.c used the same mechanism that
>> postmaster.c uses to report the various variants of regular and signal
>> exits.

> Hmm.  Getting rid of the "(PID 0)" is going to be a mess enough for
> translations that I think it is worth pgarch.c having its own routine
> for this.  Furthermore I think the detailed archive command should be
> reported in an errdetail() field, which makes it even farther off.

I agree with Alvaro --- trying to make LogChildExit serve two masters
will be uglier and less maintainable than having two copies of the
not-really-so-complex logic involved.  Leave postmaster.c alone and
just make pgarch.c smarter.

(But having said that, +1 for improving the message.)
        regards, tom lane


Re: archive_command failures report confusing exit status

From
Peter Eisentraut
Date:
Am Montag, 10. Dezember 2007 schrieb Alvaro Herrera:
> Peter Eisentraut wrote:
> > I figured it would make sense if pgarch.c used the same mechanism that
> > postmaster.c uses to report the various variants of regular and signal
> > exits.
>
> Hmm.  Getting rid of the "(PID 0)" is going to be a mess enough for
> translations that I think it is worth pgarch.c having its own routine
> for this.  Furthermore I think the detailed archive command should be
> reported in an errdetail() field, which makes it even farther off.

Better patch.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: archive_command failures report confusing exit status

From
Simon Riggs
Date:
On Tue, 2007-12-11 at 17:27 +0100, Peter Eisentraut wrote:
> Am Montag, 10. Dezember 2007 schrieb Alvaro Herrera:
> > Peter Eisentraut wrote:
> > > I figured it would make sense if pgarch.c used the same mechanism that
> > > postmaster.c uses to report the various variants of regular and signal
> > > exits.
> >
> > Hmm.  Getting rid of the "(PID 0)" is going to be a mess enough for
> > translations that I think it is worth pgarch.c having its own routine
> > for this.  Furthermore I think the detailed archive command should be
> > reported in an errdetail() field, which makes it even farther off.
> 
> Better patch.

Looks much easier to understand.

I prefer archive_command = ... for the errdetail though. The commands
can be quite long so the extra words don't really add anything, plus
they require translation.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: archive_command failures report confusing exit status

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Better patch.

Doesn't this patch break the behavior that is documented in the comment?
Specifically, the case where the restore_command dies on a signal and
this is reported to us by the controlling shell as exitcode > 128.
We want the archiver to die, but this patch makes it not do so.
        regards, tom lane


Re: archive_command failures report confusing exit status

From
Peter Eisentraut
Date:
Am Dienstag, 11. Dezember 2007 schrieb Tom Lane:
> Peter Eisentraut <peter_e@gmx.net> writes:
> > Better patch.
>
> Doesn't this patch break the behavior that is documented in the comment?
> Specifically, the case where the restore_command dies on a signal and
> this is reported to us by the controlling shell as exitcode > 128.
> We want the archiver to die, but this patch makes it not do so.

AFAICT, the coding

WIFSIGNALED(rc) || WEXITSTATUS(rc) > 128

is simply redundant, because a signal happened exactly when WIFSIGNALED(rc) is 
true.

I have tested this:

LOG:  database system was shut down at 2007-12-11 17:15:43 CET
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections
FATAL:  archive command was terminated by signal 1: Hangup
DETAIL:  The archive command was "kill -1 $$".
LOG:  archiver process (PID 22572) exited with exit code 1

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: archive_command failures report confusing exit status

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Am Dienstag, 11. Dezember 2007 schrieb Tom Lane:
>> Doesn't this patch break the behavior that is documented in the comment?
>> Specifically, the case where the restore_command dies on a signal and
>> this is reported to us by the controlling shell as exitcode > 128.
>> We want the archiver to die, but this patch makes it not do so.

> AFAICT, the coding

> WIFSIGNALED(rc) || WEXITSTATUS(rc) > 128

> is simply redundant, because a signal happened exactly when WIFSIGNALED(rc) is 
> true.

No, you are confusing the cases "called shell was killed by a signal"
and "called command was killed by a signal, which the shell then turned
around and reported to us as exit > 128".

> I have tested this:

A single test case proves little.  You need to consider race conditions
and cases where the shell is ignoring the particular signal.  I'm fairly
certain that both of these exit statuses can happen in practice, at
least with some shells/platforms.
        regards, tom lane


Re: archive_command failures report confusing exit status

From
Peter Eisentraut
Date:
Simon Riggs wrote:
> I prefer archive_command = ... for the errdetail though. The commands
> can be quite long so the extra words don't really add anything, plus
> they require translation.

I did like this suggestion, but then I noticed, we don't actually report the 
setting of the archive_command setting but the actual command that was 
executed, with the placeholders filled out.  So I think the way I posted it 
is more correct.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: archive_command failures report confusing exit status

From
Peter Eisentraut
Date:
Tom Lane wrote:
> No, you are confusing the cases "called shell was killed by a signal"
> and "called command was killed by a signal, which the shell then turned
> around and reported to us as exit > 128".

Yes, I had missed that difference.  Next try ...

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: archive_command failures report confusing exit status

From
Simon Riggs
Date:
On Tue, 2007-12-11 at 23:31 +0100, Peter Eisentraut wrote:
> Simon Riggs wrote:
> > I prefer archive_command = ... for the errdetail though. The commands
> > can be quite long so the extra words don't really add anything, plus
> > they require translation.
> 
> I did like this suggestion, but then I noticed, we don't actually report the 
> setting of the archive_command setting but the actual command that was 
> executed, with the placeholders filled out.  So I think the way I posted it 
> is more correct.

I think you should lose the "The", if nothing else. Most Postgres
messages I recall say "return code = xx" not "The return code ..."

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: archive_command failures report confusing exit status

From
Alvaro Herrera
Date:
Simon Riggs wrote:
> On Tue, 2007-12-11 at 23:31 +0100, Peter Eisentraut wrote:
> > Simon Riggs wrote:
> > > I prefer archive_command = ... for the errdetail though. The commands
> > > can be quite long so the extra words don't really add anything, plus
> > > they require translation.
> > 
> > I did like this suggestion, but then I noticed, we don't actually report the 
> > setting of the archive_command setting but the actual command that was 
> > executed, with the placeholders filled out.  So I think the way I posted it 
> > is more correct.
> 
> I think you should lose the "The", if nothing else. Most Postgres
> messages I recall say "return code = xx" not "The return code ..."

Right, that's because they are in errmessage() and not errdetail().  The
guidelines are different.  (The reason they are different really escapes
me, but you have to note that (1) Peter invented them, and (2) this
grammatical difference always make sense when you read the logs.)

-- 
Alvaro Herrera                               http://www.PlanetPostgreSQL.org/
"There is evil in the world. There are dark, awful things. Occasionally, we get
a glimpse of them. But there are dark corners; horrors almost impossible to
imagine... even in our worst nightmares." (Van Helsing, Dracula A.D. 1972)


Re: archive_command failures report confusing exit status

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Yes, I had missed that difference.  Next try ...

Looks sane to me.
        regards, tom lane


Re: archive_command failures report confusing exit status

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Simon Riggs wrote:
>> I think you should lose the "The", if nothing else. Most Postgres
>> messages I recall say "return code = xx" not "The return code ..."

> Right, that's because they are in errmessage() and not errdetail().  The
> guidelines are different.

Right --- errdetail is supposed to be a complete sentence.  (The
rationale for that is explained in TFM.)  I do find the current phrasing
a bit awkward, though; what's bugging me is putting quotes around the
command string.  It seems highly likely that the command string will
itself contain quotes and thus that the added quotes will be confusing.
I wonder if it'd be OK to do
The failed archive command was: %s

which is stretching the style guidelines by omitting a trailing period,
but I think that might be justifiable in this context.
        regards, tom lane