On Tue, Mar 1, 2011 at 5:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> >> What if fast shutdown is requested while RecordTransactionCommit
>> >> is waiting in SyncRepWaitForLSN? ISTM fast shutdown cannot complete
>> >> until replication has been successfully done (i.e., until at least one
>> >> synchronous standby has connected to the master especially if
>> >> allow_standalone_primary is disabled). Is this OK?
>> >
>> > A "behaviour" - important, though needs further discussion.
>>
>> One of the scenarios which I'm concerned is:
>>
>> 1. The primary is running with allow_standalone_primary = on.
>> 2. While some backends are waiting for replication, the user requests
>> fast shutdown.
>> 3. Since the timeout expires, those backends stop waiting and return the success
>> indication to the client (but not replicated to the standby).
>> 4. Since there is no backend waiting for replication, fast shutdown completes.
>> 5. The clusterware like pacemaker detects the death of the primary and
>> triggers the
>> failover.
>> 6. New primary doesn't have some transactions committed to the client, i.e.,
>> transaction lost happens!!
>>
>> To avoid such a transaction lost, we should prevent the primary from
>> returning the
>> success indication to the client while fast shutdown is being executed, even if
>> allow_standalone_primary is enabled, I think. Thought?
>
> Walsenders don't shutdown until after they have sent the shutdown
> checkpoint.
>
> We could make them wait for the reply from the standby at that point.
Right. But what if the replication connection is closed before #3?
In this case, obviously walsender cannot send WAL up to the
shutdown checkpoint.
> I'll think some more, not convinced yet.
Thanks! I'll think more, too, to make sync rep more reliable!
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center