Actual RC of "restore_command" is relevant for DB startup - Mailing list pgsql-docs

From Gunnar \"Nick\" Bluth
Subject Actual RC of "restore_command" is relevant for DB startup
Date
Msg-id 57177C46.6040604@elster.de
Whole thread Raw
List pgsql-docs
Hello,

I've just stumbled across a certain oddity with "restore_command" while
setting up a fresh environment with segmented (i.e., firewalled) networks.

I configured the restore_command as found in the PGBARMan docs (using
ssh) and was a bit stunned that after a restart, I saw this in the logs:

2016-04-20 13:22:45 CEST [3788]: [2-1] db=,user= FATAL:  could not
restore file "00000002.history" from archive: child process exited with
exit code 255
2016-04-20 13:22:45 CEST [3786]: [3-1] db=,user= LOG:  startup process
(PID 3788) exited with exit code 1
2016-04-20 13:22:45 CEST [3786]: [4-1] db=,user= LOG:  aborting startup
due to startup process failure

Which was obviously caused by
ssh: connect to host <archive server> port 22: Connection timed out
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.0]

Now, the firewall does not let ssh through (yet), so the root cause is
quite obvious.

However, the docs[1] only state that:
"(...) if the command was terminated by a signal (other than SIGTERM,
which is used as part of a database server shutdown) or an error by the
shell (such as command not found), then recovery will abort and the
server will not start up."

In [2], Kevin Grittner stated that it might be that the commands RC
should by <= 255, otherwise it will be assessed as "failed badly; give up".

And indeed, after amending the restore_command with a "|| exit 1", the
server starts up just fine, using replication to fetch the missing WALs.

Which is ok for me right now as a workaround, however: had I found this
not while setting everything up from scratch, but in case of a disaster
(or simply a downtime or very high load of the archive server while
restarting a slave), this (basically undocumented!) behavior would have
caused me quite a headache...!

I reckon only few users will expect a connection timeout to fall into
the category of "command not found"...

Maybe the part "error by the shell (such as command not found)" could be
changed to "error by the shell (RC > 254, e.g. command not found or ssh
connection failure)" (actually, whatever the real behaviour is, I didn't
check the sources...)?



1
http://www.postgresql.org/docs/current/static/archive-recovery-settings.html
2

http://stackoverflow.com/questions/10524458/postgresql-9-1-streaming-replication-restore-command-special-meaning-of-exit-co


Best regards,
--
Gunnar "Nick" Bluth
DBA ELSTER

Tel:   +49 911/991-4665
Mobil: +49 172/8853339

Attachment

pgsql-docs by date:

Previous
From: Alexander Law
Date:
Subject: Some minor error fixes
Next
From: Jürgen Purtz
Date:
Subject: Docbook 5.x