Re: Sync Rep v19 - Mailing list pgsql-hackers

From Fujii Masao
Subject Re: Sync Rep v19
Date
Msg-id AANLkTi=b16crDtFNU+iaQ1TGw1dPB1c9c79fFuEJ09AQ@mail.gmail.com
Whole thread Raw
In response to Re: Sync Rep v19  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: Sync Rep v19
Re: Sync Rep v19
List pgsql-hackers
On Fri, Mar 4, 2011 at 7:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> SIGTERM can be sent by pg_terminate_backend(). So we should check
>> whether shutdown is requested before emitting WARNING and closing
>> the connection. If it's not requested yet, I think that it's safe to return the
>> success indication to the client.
>
> I'm not sure if that matters. Nobody apart from the postmaster knows
> about a shutdown. All the other processes know is that they received
> SIGTERM, which as you say could have been a specific user action aimed
> at an individual process.
>
> We need a way to end the wait state explicitly, so it seems easier to
> make SIGTERM the initiating action, no matter how it is received.
>
> The alternative is to handle it this way
> 1) set something in shared memory
> 2) set latch of all backends
> 3) have the backends read shared memory and then end the wait
>
> Who would do (1) and (2)? Not the backend, its sleeping, not the
> postmaster its shm, nor a WALSender cos it might not be there.
>
> Seems like a lot of effort to avoid SIGTERM. Do we have a good reason
> why we need that? Might it introduce other issues?

On the second thought...

I was totally wrong. Preventing the backend from returning the commit
when shutdown is requested doesn't help to avoid the data loss at all.
Without shutdown, the following simple scenario can cause data loss.

1. Replication connection is closed because of network outage.
2. Though replication has not been completed, the waiting backend is   released since the timeout expires. Then it
returnsthe success to   the client.
 
3. The primary crashes, and then the clusterware promotes the standby   which doesn't have the latest change on the
primaryto new primary.   Data lost happens!
 

In the first place, there are two kinds of data loss:

(A) Pysical data loss
This is the case where we can never retrieve the committed data
physically. For example, if the storage of the standalone server gets
corrupted, we would lose some data forever. To avoid this type of
data loss, we would have to choose the "wait-forever" behavior. But
as I said in upthread, we can decrease the risk of this data loss to
a certain extent by spending much money on the storage. So, if that
cost is less than the cost which we have to pay when down-time
happens, we don't need to choose the "wait-forever" option.

(B) Logical data loss
This is the case where we think wrongly that the committed data
has been lost while we can actually retrieve it physically. For example,
in the above three-steps scenario, we can read all the committed data
from two servers physically even after failover. But since the client
attempts to read data only from new primary, some data looks lost to
the client. The "wait-forever" behavior can help also to avoid this type
of data loss. And, another way is to STONITH the standby before the
timeout releases any waiting backend. If so, we can completely prevent
the outdated standby from being brought up, and can avoid logical data
loss. According to my quick research, in DRBD, the "dopd (DRBD
outdate-peer daemon)" plays that role.

What I'd like to avoid is (B). Though (A) is more serious problem than (B),
we already have some techniques to decrease the risk of (A). But not
(B), I think.

The "wait-forever" might be a straightforward approach against (B). But
this option prevents transactions from running not only when the
synchronous standby goes away, but also when the primary is invoked
first or when the standby is promoted at failover. Since the availability
of the database service decreases very much, I don't want to use that.

Keeping transactions waiting in the latter two cases would be required
to avoid (A), but not (B). So I think that we can relax the "wait-forever"
option so that it allows not-replicated transactions to complete only in
those cases. IOW, when we initially start the primary, the backends
don't wait at all for new standby to connect. And, while new primary is
running alone after failover, the backends don't wait at all, too. Only
when replication connection is closed while streaming WAL to sync
standby, the backends wait until new sync standby has connected and
replication has been completed. Even in this case, if we want to
improve the service availability, we have only to make something like
dopd to STONITH the outdated standby, and then request the primary
to release the waiting backends. So I think that the interface to
request that release should be implemented.

Fortunately, that partial "wait-forever" behavior has already been
implemented in Simon's patch with the client timeout = 0 (disable).
If he implements the interface to release the waiting backends,
I'm OK with his design about when to release the backends for 9.1
(unless I'm missing something).

Thought?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Quick Extensions Question
Next
From: Robert Haas
Date:
Subject: Re: Sync Rep v19