Thread: Re: BUG #7546: Backups on hot standby cancelled despite hot_standby_feedback=on

Re: BUG #7546: Backups on hot standby cancelled despite hot_standby_feedback=on

From
Stuart Bishop
Date:
On Wed, Sep 19, 2012 at 5:45 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Mon, Sep 17, 2012 at 7:28 PM,  <stuart@stuartbishop.net> wrote:
>> The following bug has been logged on the website:
>>
>> Bug reference:      7546
>> Logged by:          Stuart Bishop
>> Email address:      stuart@stuartbishop.net
>> PostgreSQL version: 9.1.5
>> Operating system:   Ubuntu 12.10
>> Description:
>>
>> I have a primary and a hot standby using streaming replication. The hot
>> standby specifies 'hot_standby_feedback=on' with other replication settings
>> set to default.
>>
>> If a vacuum occurs on the primary while pg_dump is dumping a large table,
>> the pg_dump is cancelled, usually with the following error:
>>
>> ERROR:  canceling statement due to conflict with recovery
>> DETAIL:  User was holding shared buffer pin for too long.
>
> Maybe I'm missing something, but ISTM that hot_standby_feedback doesn't
> eliminate the query cancels caused by buffer pin lock which you encountered.
> It eliminates only the query cancels caused by cleanup of rows. So you might
> need to set max_standby_streaming_delay to -1, to avoid query cancels.

I have also seen the following (but only on production, not with my test case):

ERROR:  canceling statement due to conflict with recovery
DETAIL:  User was holding a relation lock for too long.

If you are correct, it seems impossible with 9.1 to have reliable long
transactions on a hot standby without allowing the hot standby to lag
behind by the length of the longest transaction.

I was hoping when the docs state "this parameter can be used to
eliminate query cancels caused by cleanup records" that it would cover
all the background maintenance.

Disabling autovacuum during the backup window would be one work around.

--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/

Re: BUG #7546: Backups on hot standby cancelled despite hot_standby_feedback=on

From
Andres Freund
Date:
On Wednesday, September 19, 2012 10:06:03 AM Stuart Bishop wrote:
> On Wed, Sep 19, 2012 at 5:45 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> > On Mon, Sep 17, 2012 at 7:28 PM,  <stuart@stuartbishop.net> wrote:
> >> The following bug has been logged on the website:
> >>
> >> Bug reference:      7546
> >> Logged by:          Stuart Bishop
> >> Email address:      stuart@stuartbishop.net
> >> PostgreSQL version: 9.1.5
> >> Operating system:   Ubuntu 12.10
> >> Description:
> >>
> >> I have a primary and a hot standby using streaming replication. The hot
> >> standby specifies 'hot_standby_feedback=on' with other replication
> >> settings set to default.
> >>
> >> If a vacuum occurs on the primary while pg_dump is dumping a large
> >> table, the pg_dump is cancelled, usually with the following error:
> >>
> >> ERROR:  canceling statement due to conflict with recovery
> >> DETAIL:  User was holding shared buffer pin for too long.
> >
> > Maybe I'm missing something, but ISTM that hot_standby_feedback doesn't
> > eliminate the query cancels caused by buffer pin lock which you
> > encountered. It eliminates only the query cancels caused by cleanup of
> > rows. So you might need to set max_standby_streaming_delay to -1, to
> > avoid query cancels.
I think the above is just a case of the backend waiting a long time to send
data out due to your rate limiting. Why it does that holding a buffer pin may
be worth investigating...

> I have also seen the following (but only on production, not with my test
> case):
>
> ERROR:  canceling statement due to conflict with recovery
> DETAIL:  User was holding a relation lock for too long.
>
> If you are correct, it seems impossible with 9.1 to have reliable long
> transactions on a hot standby without allowing the hot standby to lag
> behind by the length of the longest transaction.
Uh. I don't think thats a valid chain of thought, especially when it comes due
to conflicts on relation locks. Those primarily (only?) happen if you do
perform an action on the primary that requires an exlusive lock. Like ALTER
TABLE, TRUNCATE, CLUSTER, LOCK TABLE, ... There is not really much the standby
can do if gets a request for an exlusive lock other than either cancelling
everything that blocks acquiring such a lock or waiting.

Is it possible that you perform such commands on tables used on the standby?

> I was hoping when the docs state "this parameter can be used to
> eliminate query cancels caused by cleanup records" that it would cover
> all the background maintenance.
Relation lock conflicts shouldn't be caused by background maintenance.

Greetings,

Andres
--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #7546: Backups on hot standby cancelled despite hot_standby_feedback=on

From
Stuart Bishop
Date:
On Wed, Sep 19, 2012 at 3:38 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> >> ERROR:  canceling statement due to conflict with recovery
>> >> DETAIL:  User was holding shared buffer pin for too long.

>> > Maybe I'm missing something, but ISTM that hot_standby_feedback doesn't
>> > eliminate the query cancels caused by buffer pin lock which you
>> > encountered. It eliminates only the query cancels caused by cleanup of
>> > rows. So you might need to set max_standby_streaming_delay to -1, to
>> > avoid query cancels.

> I think the above is just a case of the backend waiting a long time to send
> data out due to your rate limiting. Why it does that holding a buffer pin may
> be worth investigating...

I put the rate limiting in there to trigger the problem consistently
with a small data set. On the main production server I see this
problem, it normally fails dumping a table with nearly 2 billion rows
to local disk (pg_dump -Fc). The trick seems to be that vacuum needs
to start running on the primary after the dump starts on the hot
standby, and the vacuum complete before the dump finishes.

>> I have also seen the following (but only on production, not with my test
>> case):
>>
>> ERROR:  canceling statement due to conflict with recovery
>> DETAIL:  User was holding a relation lock for too long.
>>
>> If you are correct, it seems impossible with 9.1 to have reliable long
>> transactions on a hot standby without allowing the hot standby to lag
>> behind by the length of the longest transaction.

> Is it possible that you perform such commands on tables used on the standby?

Yes, it is quite possible CREATE INDEX CONCURRENTLY was run on the
primary during the backup window. I can't discount other commands
either, but if that is the case one of our ops needs a spanking. I've
only been able to reproduce the buffer pin error in isolation.


--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/

Re: BUG #7546: Backups on hot standby cancelled despite hot_standby_feedback=on

From
Andres Freund
Date:
On Wednesday, September 19, 2012 11:33:29 AM Stuart Bishop wrote:
> On Wed, Sep 19, 2012 at 3:38 PM, Andres Freund <andres@2ndquadrant.com>
wrote:
> >> >> ERROR:  canceling statement due to conflict with recovery
> >> >> DETAIL:  User was holding shared buffer pin for too long.
> >> >
> >> > Maybe I'm missing something, but ISTM that hot_standby_feedback
> >> > doesn't eliminate the query cancels caused by buffer pin lock which
> >> > you encountered. It eliminates only the query cancels caused by
> >> > cleanup of rows. So you might need to set max_standby_streaming_delay
> >> > to -1, to avoid query cancels.
> >
> > I think the above is just a case of the backend waiting a long time to
> > send data out due to your rate limiting. Why it does that holding a
> > buffer pin may be worth investigating...
>
> I put the rate limiting in there to trigger the problem consistently
> with a small data set. On the main production server I see this
> problem, it normally fails dumping a table with nearly 2 billion rows
> to local disk (pg_dump -Fc). The trick seems to be that vacuum needs
> to start running on the primary after the dump starts on the hot
> standby, and the vacuum complete before the dump finishes.
You see buffer pin errors or you see the relation errors? Those arent the same.
So if you see no buffer pin errors but relation lock ones in production your
way to reproduce the issue isn't telling us very much ;)

> >> I have also seen the following (but only on production, not with my test
> >> case):
> >>
> >> ERROR:  canceling statement due to conflict with recovery
> >> DETAIL:  User was holding a relation lock for too long.
> >>
> >> If you are correct, it seems impossible with 9.1 to have reliable long
> >> transactions on a hot standby without allowing the hot standby to lag
> >> behind by the length of the longest transaction.
> >
> > Is it possible that you perform such commands on tables used on the
> > standby?
>
> Yes, it is quite possible CREATE INDEX CONCURRENTLY was run on the
> primary during the backup window. I can't discount other commands
> either, but if that is the case one of our ops needs a spanking. I've
> only been able to reproduce the buffer pin error in isolation.
Do you have DDL logging enabled so you could investigate that?

I really wonder why we don't bother logging the dbOid/relOid pair before we
cancel transactions due to lock conflicts. That should make investigating such
issues considerably easier...

Greetings,

Andres
--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #7546: Backups on hot standby cancelled despite hot_standby_feedback=on

From
Stuart Bishop
Date:
On Wed, Sep 19, 2012 at 6:56 PM, Andres Freund <andres@2ndquadrant.com> wrote:

> You see buffer pin errors or you see the relation errors? Those arent the same.
> So if you see no buffer pin errors but relation lock ones in production your
> way to reproduce the issue isn't telling us very much ;)

On production I have seen both. I have only been able to confirm and
isolate the trigger of the buffer pin errors, and thus this bug
report. I mentioned the rarer relation lock error just in case it was
relevant.

> Do you have DDL logging enabled so you could investigate that?
>
> I really wonder why we don't bother logging the dbOid/relOid pair before we
> cancel transactions due to lock conflicts. That should make investigating such
> issues considerably easier...

I do have DDL logging enabled. I've waded through a two hour period on
the primary where the problem must have occurred and can see no DDL
except for the creation and dropping of temporary tables. My parser is
unfortunately grep and my eyeballs to filter out the noise.


--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/

Re: BUG #7546: Backups on hot standby cancelled despite hot_standby_feedback=on

From
Stuart Bishop
Date:
Yesterday's dump failed with a new error. pg_stat_replication tells me
replication has been connected for nearly a month, so no network
issues. I've confirmed hot_standby_feedback is still set to 'on':

ERROR:  canceling statement due to conflict with recovery
DETAIL:  User query might have needed to see row versions that must be removed.

As before, the only error I've been able to reproduce in isolation is
the buffer pin error I originally reported.

--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/