Re: Sync Rep v19 - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Sync Rep v19
Date
Msg-id 1299277722.10703.7120.camel@ebony
Whole thread Raw
In response to Re: Sync Rep v19  (Fujii Masao <masao.fujii@gmail.com>)
Responses Re: Sync Rep v19
Re: Sync Rep v19
List pgsql-hackers
On Sat, 2011-03-05 at 05:04 +0900, Fujii Masao wrote:
> On Fri, Mar 4, 2011 at 7:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >> SIGTERM can be sent by pg_terminate_backend(). So we should check
> >> whether shutdown is requested before emitting WARNING and closing
> >> the connection. If it's not requested yet, I think that it's safe to return the
> >> success indication to the client.
> >
> > I'm not sure if that matters. Nobody apart from the postmaster knows
> > about a shutdown. All the other processes know is that they received
> > SIGTERM, which as you say could have been a specific user action aimed
> > at an individual process.
> >
> > We need a way to end the wait state explicitly, so it seems easier to
> > make SIGTERM the initiating action, no matter how it is received.
> >
> > The alternative is to handle it this way
> > 1) set something in shared memory
> > 2) set latch of all backends
> > 3) have the backends read shared memory and then end the wait
> >
> > Who would do (1) and (2)? Not the backend, its sleeping, not the
> > postmaster its shm, nor a WALSender cos it might not be there.
> >
> > Seems like a lot of effort to avoid SIGTERM. Do we have a good reason
> > why we need that? Might it introduce other issues?
>
> On the second thought...
>
> I was totally wrong. Preventing the backend from returning the commit
> when shutdown is requested doesn't help to avoid the data loss at all.
> Without shutdown, the following simple scenario can cause data loss.
>
> 1. Replication connection is closed because of network outage.
> 2. Though replication has not been completed, the waiting backend is
>     released since the timeout expires. Then it returns the success to
>     the client.
> 3. The primary crashes, and then the clusterware promotes the standby
>     which doesn't have the latest change on the primary to new primary.
>     Data lost happens!

Yes, that can happen. As people will no doubt observe, this seems to be
an argument for wait-forever. What we actually need is a wait that lasts
longer than it takes for us to decide to failover, if the standby is
actually up and this is some kind of split brain situation. That way the
clients are still waiting when failover occurs. WAL is missing, but
since we didn't acknowledge the client we are OK to treat that situation
as if it were an abort.

> In the first place, there are two kinds of data loss:
>
> (A) Pysical data loss
> This is the case where we can never retrieve the committed data
> physically. For example, if the storage of the standalone server gets
> corrupted, we would lose some data forever. To avoid this type of
> data loss, we would have to choose the "wait-forever" behavior. But
> as I said in upthread, we can decrease the risk of this data loss to
> a certain extent by spending much money on the storage. So, if that
> cost is less than the cost which we have to pay when down-time
> happens, we don't need to choose the "wait-forever" option.
>
> (B) Logical data loss
> This is the case where we think wrongly that the committed data
> has been lost while we can actually retrieve it physically. For example,
> in the above three-steps scenario, we can read all the committed data
> from two servers physically even after failover. But since the client
> attempts to read data only from new primary, some data looks lost to
> the client. The "wait-forever" behavior can help also to avoid this type
> of data loss. And, another way is to STONITH the standby before the
> timeout releases any waiting backend. If so, we can completely prevent
> the outdated standby from being brought up, and can avoid logical data
> loss. According to my quick research, in DRBD, the "dopd (DRBD
> outdate-peer daemon)" plays that role.
>
> What I'd like to avoid is (B). Though (A) is more serious problem than (B),
> we already have some techniques to decrease the risk of (A). But not
> (B), I think.
>
> The "wait-forever" might be a straightforward approach against (B). But
> this option prevents transactions from running not only when the
> synchronous standby goes away, but also when the primary is invoked
> first or when the standby is promoted at failover. Since the availability
> of the database service decreases very much, I don't want to use that.
>
> Keeping transactions waiting in the latter two cases would be required
> to avoid (A), but not (B). So I think that we can relax the "wait-forever"
> option so that it allows not-replicated transactions to complete only in
> those cases. IOW, when we initially start the primary, the backends
> don't wait at all for new standby to connect. And, while new primary is
> running alone after failover, the backends don't wait at all, too. Only
> when replication connection is closed while streaming WAL to sync
> standby, the backends wait until new sync standby has connected and
> replication has been completed. Even in this case, if we want to
> improve the service availability, we have only to make something like
> dopd to STONITH the outdated standby, and then request the primary
> to release the waiting backends. So I think that the interface to
> request that release should be implemented.
>
> Fortunately, that partial "wait-forever" behavior has already been
> implemented in Simon's patch with the client timeout = 0 (disable).
> If he implements the interface to release the waiting backends,
> I'm OK with his design about when to release the backends for 9.1
> (unless I'm missing something).

Almost-working patch attached for the above feature. Time to stop for
the day. Patch against current repo version.

Current repo version attached here also (v20), which includes all fixes
to all known technical issues, major polishing etc..

--
 Simon Riggs           http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services


Attachment

pgsql-hackers by date:

Previous
From: Yeb Havinga
Date:
Subject: Re: Sync Rep v19
Next
From: Simon Riggs
Date:
Subject: Re: wrapping up this CommitFest (was Re: knngist - 0.8)