Thread: BUG #13559: WAL replay stuck after DROP VIEW

BUG #13559: WAL replay stuck after DROP VIEW

From
maciek@heroku.com
Date:
The following bug has been logged on the website:

Bug reference:      13559
Logged by:          Maciek Sakrejda
Email address:      maciek@heroku.com
PostgreSQL version: 9.4.1
Operating system:   Ubuntu 12.04
Description:

We had some code in production that automatically dropped and recreated
views periodically. This database also has a replica that serves some
moderately intensive queries (read: on the order of several minutes). This
generally this works fine, but we ran into an issue the other day where the
startup process on the replica was holding a bunch of AccessExclusive locks
on these views (presumably due to the DROP) and would not progress even
though there were no conflicting queries (there may very well have been
queries against these views at one point, but not not when I looked--all the
locks held by the startup process showed up as granted in pg_locks). This
resolved when we restarted the replica.

We're on 9.4.1, but skimming through the 9.4.2-through-9.4.4 release notes,
I don't see anything relevant.

Could this be an outstanding bug? For what it's worth, we've been running
the view drop / recreate for about 9 months, totalling probably ~240 drops /
creates and this is the first time we've run into an issue.

Re: BUG #13559: WAL replay stuck after DROP VIEW

From
Michael Paquier
Date:
On Tue, Aug 11, 2015 at 7:31 AM,  <maciek@heroku.com> wrote:
> We had some code in production that automatically dropped and recreated
> views periodically. This database also has a replica that serves some
> moderately intensive queries (read: on the order of several minutes). This
> generally this works fine, but we ran into an issue the other day where the
> startup process on the replica was holding a bunch of AccessExclusive locks
> on these views (presumably due to the DROP) and would not progress even
> though there were no conflicting queries (there may very well have been
> queries against these views at one point, but not not when I looked--all the
> locks held by the startup process showed up as granted in pg_locks). This
> resolved when we restarted the replica.

Yes, possible. Access exclusive lock is taken when dropping the
relation before removing it from heap.

> We're on 9.4.1, but skimming through the 9.4.2-through-9.4.4 release notes,
> I don't see anything relevant.
>
> Could this be an outstanding bug? For what it's worth, we've been running
> the view drop / recreate for about 9 months, totalling probably ~240 drops /
> creates and this is the first time we've run into an issue.

The closest fix post-9.4.4 fix in this area is this commit:
commit: bab959906911c97437f410a03b0346e6dd28d528
author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
date: Sun, 2 Aug 2015 20:09:05 +0300
Fix race condition that lead to WALInsertLock deadlock with commit_delay.

But this fixed a problem on master where queries got stuck because of
the WAL insert lock patch that has been introduced in 9.4 because of a
race condition between two backends.

Do you have some remnants of the logs when this problem happened? Or
even better, a stack trace to understand where the startup process got
stucked? That's perhaps unrelated, but were you using commit_delay?

I have been playing with pgbench, creating and dropping views with a
couple of hundred connections on 9.4.1 but could not reproduce the
issue.
Regards,
--
Michael

Re: BUG #13559: WAL replay stuck after DROP VIEW

From
Maciek Sakrejda
Date:
On Mon, Aug 10, 2015 at 8:18 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

> But this fixed a problem on master where queries got stuck because of
> the WAL insert lock patch that has been introduced in 9.4 because of a
> race condition between two backends.
>

Interesting, thanks for pointing this out.

Do you have some remnants of the logs when this problem happened? Or
> even better, a stack trace to understand where the startup process got
> stucked? That's perhaps unrelated, but were you using commit_delay?
>

We do have logs from the primary and the replica, but I can't share them
directly. Is there anything specific I should look for? Unfortunately, I
did not think to take a stack trace from the startup process. And
commit_delay is not set.

I have been playing with pgbench, creating and dropping views with a
> couple of hundred connections on 9.4.1 but could not reproduce the
> issue.
>

Thanks for trying it out. Were you running queries against these views on
the replica when dropping/recreating them on the primary? I was guessing
maybe the lock interplay there had something to do with it, although I
admit my understanding of these mechanisms is pretty superficial.

Re: BUG #13559: WAL replay stuck after DROP VIEW

From
Tom Lane
Date:
Maciek Sakrejda <maciek@heroku.com> writes:
> On Mon, Aug 10, 2015 at 8:18 PM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>> But this fixed a problem on master where queries got stuck because of
>> the WAL insert lock patch that has been introduced in 9.4 because of a
>> race condition between two backends.

> Interesting, thanks for pointing this out.

FWIW, I doubt that that particular fix is related, because in a replica
server there are no processes performing WAL inserts.

A stack trace of the startup process would definitely be a good thing to
grab if/when you see this again.

And, of course, there is always the standard advice that you ought to
be running current release --- 9.4.4 not 9.4.1.  I do not recognize this
as something we've fixed recently, but that doesn't mean it isn't.

            regards, tom lane

Re: BUG #13559: WAL replay stuck after DROP VIEW

From
Michael Paquier
Date:
On Tue, Aug 11, 2015 at 1:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Maciek Sakrejda <maciek@heroku.com> writes:
>> On Mon, Aug 10, 2015 at 8:18 PM, Michael Paquier <michael.paquier@gmail.com>
>> wrote:
>>> But this fixed a problem on master where queries got stuck because of
>>> the WAL insert lock patch that has been introduced in 9.4 because of a
>>> race condition between two backends.
>
>> Interesting, thanks for pointing this out.
>
> FWIW, I doubt that that particular fix is related, because in a replica
> server there are no processes performing WAL inserts.

Yes, I was more thinking about a problem where XLogFlush interacts
badly with heap_delete. But that's hard to guess anything without
looking at the backtrace and a reproducible test case...

> A stack trace of the startup process would definitely be a good thing to
> grab if/when you see this again.

Yep.
--
Michael

Re: BUG #13559: WAL replay stuck after DROP VIEW

From
Greg Stark
Date:
On Mon, Aug 10, 2015 at 11:31 PM,  <maciek@heroku.com> wrote:
> We had some code in production that automatically dropped and recreated
> views periodically. This database also has a replica that serves some
> moderately intensive queries (read: on the order of several minutes). This
> generally this works fine, but we ran into an issue the other day where the
> startup process on the replica was holding a bunch of AccessExclusive locks
> on these views (presumably due to the DROP) and would not progress even
> though there were no conflicting queries (there may very well have been
> queries against these views at one point, but not not when I looked--all the
> locks held by the startup process showed up as granted in pg_locks). This
> resolved when we restarted the replica.

Is hot_standby_feedback enabled? Or vacuum_defer_cleanup_age set? Is
max_standby_*_delay set?

If all of these are off/zero then this sounds like the standby replays
an exclusive lock which blocks a query running in the standby, then
hits a vacuum record in the WAL log which it stops replay because the
blocked query has an old enough snapshot to see the record being
cleaned up.

--
greg

Re: BUG #13559: WAL replay stuck after DROP VIEW

From
Maciek Sakrejda
Date:
hot_standby_feedback is on, vacuum_defer_cleanup_age is not set, and
max_standby_{streaming,archive}_delay are set to -1.

>If all of these are off/zero then this sounds like the standby replays
>an exclusive lock which blocks a query running in the standby, then
>hits a vacuum record in the WAL log which it stops replay because the
>blocked query has an old enough snapshot to see the record being
>cleaned up.

Shouldn't the replay proceed once the query finishes (or is canceled),
though?

Re: BUG #13559: WAL replay stuck after DROP VIEW

From
Greg Stark
Date:
On Mon, Aug 17, 2015 at 10:59 PM, Maciek Sakrejda <maciek@heroku.com> wrote:
> hot_standby_feedback is on, vacuum_defer_cleanup_age is not set, and
> max_standby_{streaming,archive}_delay are set to -1.
>
>>If all of these are off/zero then this sounds like the standby replays
>>an exclusive lock which blocks a query running in the standby, then
>>hits a vacuum record in the WAL log which it stops replay because the
>>blocked query has an old enough snapshot to see the record being
>>cleaned up.
>
> Shouldn't the replay proceed once the query finishes (or is canceled),
> though?
>

But if the query is stuck waiting on a lock held by the replay then it
never will finish. I *think* what's happening is that the lock is
coming from a vacuum trying to truncate a commonly used table like
pg_statistic. I'm not exactly clear on the whole sequence of events,
it might depend on two vacuums happening concurrently.

If you cancel all the queries that are running on the standby the xlog
replay should continue when you hit then one holding up the replay.

It may be a a good idea to actually reify replay blocking as a
heavyweight lock on the transaction with the oldest xmin. That would
possibly let the deadlock detector kick in (though iirc the deadlock
detector might be disabled on replicas) or at least let people see
something that makes a bit more sense than just everything stopping
for unclear reasons.

--
greg