Hi,
On 2015-02-23 15:25:57 +0000, Thom Brown wrote:
> I've noticed that if the primary is started and then a base backup is
> immediately taken from it and started as as a synchronous standby, it
> doesn't replicate and the primary hangs indefinitely when trying to run any
> WAL-generating statements. It only recovers when either the primary is
> restarted (which has to use a fast shutdown otherwise it also hangs
> forever), or the standby is restarted.
>
> Here's a way of reproducing it:
> ...
> Note that if you run the commands one by one, there isn't a problem. If
> you run it as a script, the standby doesn't connect to the primary. There
> aren't any errors reported by either the standby or the primary. The
> primary's wal sender process reports the following:
>
> wal sender process rep_user 127.0.0.1(45243) startup waiting for 0/3000158
>
> Anyone know why this would be happening? And if this could be a problem in
> other scenarios?
Given that normally a walsender doesn't wait for syncrep I guess this is
the above backend just did authentication. If you gdb into the
walsender, what's the backtrace?
We previously had discussions about that being rather annoying; I
unfortunately don't remember enough of the thread to reference it
here. If it really is this, I think we should add some more smarts about
only enabling syncrep once a backend is fully up and maybe even remove
it from more scenarios during commits generally (e.g. if no xid was
assigned and we just pruned something).
Greetings,
Andres Freund
-- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services