Thread: Re: [GENERAL] [streaming replication] 9.1.3 streaming replication bug ?

Re: [GENERAL] [streaming replication] 9.1.3 streaming replication bug ?

From

Fujii Masao

Date:

11 April 2012, 15:56:50

On Wed, Apr 11, 2012 at 3:31 PM, 乔志强 <qiaozhiqiang@leadcoretech.com> wrote:
> So in sync streaming replication, if master delete WAL before sent to the only standby, all transaction will fail
forever,
> "the master tries to avoid a PANIC error rather than termination of replication." but in sync replication,
terminationof replication is THE bigger PANIC error. 

I see your point. When there are backends waiting for replication, the WAL files
which the standby might not have received yet must not be removed. If they are
removed, replication keeps failing forever because required WAL files don't
exist in the master, and then waiting backends will never be released unless
replication mode is changed to async. This should be avoided.

To fix this issue, we should prevent the master from deleting the WAL files
including the minimum waiting LSN or bigger ones. I'll think more and implement
the patch.

Regards,

--
Fujii Masao

Re: [GENERAL] [streaming replication] 9.1.3 streaming replication bug ?

From

Michael Nolan

Date:

11 April 2012, 16:10:08

On 4/11/12, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, Apr 11, 2012 at 3:31 PM, 乔志强 <qiaozhiqiang@leadcoretech.com> wrote:
>> So in sync streaming replication, if master delete WAL before sent to the
>> only standby, all transaction will fail forever,
>> "the master tries to avoid a PANIC error rather than termination of
>> replication." but in sync replication, termination of replication is THE
>> bigger PANIC error.
>
> I see your point. When there are backends waiting for replication, the WAL
> files
> which the standby might not have received yet must not be removed. If they
> are
> removed, replication keeps failing forever because required WAL files don't
> exist in the master, and then waiting backends will never be released unless
> replication mode is changed to async. This should be avoided.
>
> To fix this issue, we should prevent the master from deleting the WAL files
> including the minimum waiting LSN or bigger ones. I'll think more and
> implement
> the patch.

With asynchonous replication, does the master even know if a slave
fails because of a WAL problem?  And does/should it care?

Isn't there a separate issue with synchronous replication?  If it
fails, what's the appropriate action to take on the master?  PANICing
it seems to be a bad idea, but having transactions never complete
because they never hear back from the synchronous slave (for whatever
reason) seems bad too.
--
Mike Nolan

Re: [GENERAL] [streaming replication] 9.1.3 streaming replication bug ?

From

Fujii Masao

Date:

11 April 2012, 16:36:11

On Thu, Apr 12, 2012 at 12:56 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, Apr 11, 2012 at 3:31 PM, 乔志强 <qiaozhiqiang@leadcoretech.com> wrote:
>> So in sync streaming replication, if master delete WAL before sent to the only standby, all transaction will fail
forever,
>> "the master tries to avoid a PANIC error rather than termination of replication." but in sync replication,
terminationof replication is THE bigger PANIC error. 
>
> I see your point. When there are backends waiting for replication, the WAL files
> which the standby might not have received yet must not be removed. If they are
> removed, replication keeps failing forever because required WAL files don't
> exist in the master, and then waiting backends will never be released unless
> replication mode is changed to async. This should be avoided.

On second thought, we can avoid the issue by just increasing
wal_keep_segments enough. Even if the issue happens and some backends
get stuck to wait for replication, we can release them by taking fresh backup
and restarting the standby from that backup. This is the basic procedure to
restart replication after replication is terminated because required WAL files
are removed from the master. So this issue might not be worth implementing
the patch for now (though I'm not against improving things in the future), but
it seems just a tuning-problem of wal_keep_segments.

Regards,

--
Fujii Masao

Re: [GENERAL] [streaming replication] 9.1.3 streaming replication bug ?

From

Robert Haas

Date:

11 April 2012, 18:14:26

On Wed, Apr 11, 2012 at 12:35 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Thu, Apr 12, 2012 at 12:56 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Wed, Apr 11, 2012 at 3:31 PM, 乔志强 <qiaozhiqiang@leadcoretech.com> wrote:
>>> So in sync streaming replication, if master delete WAL before sent to the only standby, all transaction will fail
forever,
>>> "the master tries to avoid a PANIC error rather than termination of replication." but in sync replication,
terminationof replication is THE bigger PANIC error. 
>>
>> I see your point. When there are backends waiting for replication, the WAL files
>> which the standby might not have received yet must not be removed. If they are
>> removed, replication keeps failing forever because required WAL files don't
>> exist in the master, and then waiting backends will never be released unless
>> replication mode is changed to async. This should be avoided.
>
> On second thought, we can avoid the issue by just increasing
> wal_keep_segments enough. Even if the issue happens and some backends
> get stuck to wait for replication, we can release them by taking fresh backup
> and restarting the standby from that backup. This is the basic procedure to
> restart replication after replication is terminated because required WAL files
> are removed from the master. So this issue might not be worth implementing
> the patch for now (though I'm not against improving things in the future), but
> it seems just a tuning-problem of wal_keep_segments.

We've talked about teaching the master to keep track of how far back
all of its known standbys are, and retaining WAL back to that specific
point, rather than the shotgun approach that is wal_keep_segments.
It's not exactly clear what the interface to that should look like,
though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company