Simon, Fujii, All:
While demoing HS/SR at SCALE, I ran into a problem which is likely to be
a commonly encountered bug when people first setup HS/SR. Here's the
sequence:
1) Set up a brand new master with an archive-commmand and archive=on.
2) Start the master
3) Do a pg_start_backup()
4) Realize, based on log error messages, that I've misconfigured the
archive_command.
5) Attempt to shut down the master. Master tells me that pg_stop_backup
must be run in order to shut down.
6) Execute pg_stop_backup.
7) pg_stop_backup waits forever without ever stopping backup. Ever 60
seconds, it give me a helpful "still waiting" message, but at least in
the amount of time I was willing to wait (5 minutes), it never completed.
8) do an immediate shutdown, as it's the only way I can get the database
unstuck.
With some experimentation, the problem seems to occur when you have a
failing archive_command and a master which currently has no database
traffic; for example, if I did some database write activity (a createdb)
then pg_stop_backup would complete after about 60 seconds (which, btw,
is extremely annoying, but at least tolerable).
This issue is 100% reproduceable.
--Josh Berkus