On Mon, Aug 17, 2015 at 10:59 PM, Maciek Sakrejda <maciek@heroku.com> wrote:
> hot_standby_feedback is on, vacuum_defer_cleanup_age is not set, and
> max_standby_{streaming,archive}_delay are set to -1.
>
>>If all of these are off/zero then this sounds like the standby replays
>>an exclusive lock which blocks a query running in the standby, then
>>hits a vacuum record in the WAL log which it stops replay because the
>>blocked query has an old enough snapshot to see the record being
>>cleaned up.
>
> Shouldn't the replay proceed once the query finishes (or is canceled),
> though?
>
But if the query is stuck waiting on a lock held by the replay then it
never will finish. I *think* what's happening is that the lock is
coming from a vacuum trying to truncate a commonly used table like
pg_statistic. I'm not exactly clear on the whole sequence of events,
it might depend on two vacuums happening concurrently.
If you cancel all the queries that are running on the standby the xlog
replay should continue when you hit then one holding up the replay.
It may be a a good idea to actually reify replay blocking as a
heavyweight lock on the transaction with the oldest xmin. That would
possibly let the deadlock detector kick in (though iirc the deadlock
detector might be disabled on replicas) or at least let people see
something that makes a bit more sense than just everything stopping
for unclear reasons.
--
greg