Re: Issues with Quorum Commit - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Issues with Quorum Commit
Date
Msg-id 1286332309.2025.3941.camel@ebony
Whole thread Raw
In response to Re: Issues with Quorum Commit  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Issues with Quorum Commit
List pgsql-hackers
On Tue, 2010-10-05 at 18:52 -0700, Jeff Davis wrote:

> I'm not saying that an unavailable system is good, but I don't see how
> my particular complaint applies to the "wait for all servers to apply"
> case.

> The case I was worried about is:
>  * 1 master and 2 standby
>  * The rule is "wait for at least one standby to apply the WAL"
> 
> In your notation, I believe that's M -> { S1, S2 }
> 
> In that case, if one S1 is just a little faster than S2, then S2 might
> build up a significant queue of unapplied WAL. Then, when S1 goes down,
> there's no way for the slower one to acknowledge a new transaction
> without playing through all of the unapplied WAL.

That situation would require two things
* First, you have set up async replication and you're not monitoring it
properly. Shame on you.
* Second, you would have to request "apply" mode sync rep. If you had
requested "recv" or "fsync" mode, then the standby does *not* have to
have applied the WAL before acknowledgement.

Since the first problem is a generic problem with async replication, and
can already happen in 8.2+, its not exactly an argument against a new
feature.

> Intuitively, the administrator would think that he was getting both HA
> and redundancy, but in reality the availability is no better than if
> there were only two servers (M -> S1), except that it might be faster to
> replay the WAL then to set up a new standby (but that's not guaranteed).

Not guaranteed, but very likely that the standby would not be that far
behind. If it gets too far behind it will likely blow out the disk space
on the standby and fail.

> I think you would call that a misconfiguration, and I would agree. 

Yes, regrettably there are various ways to misconfigure this. The above
is really a degeneration of the 2 standby case into the 1 standby case:
if you ask for 2 standbys and one of them is ineffective, then the
system acts like you have only one.

> I was
> just trying to point out a pitfall that I didn't see until I read Josh's
> email.

You mention that it cannot occur if we choose to lock up the master and
cause transactions to wait. That may be true in many cases. It does
still occur when we have transactions that generate a large amount of
WAL, loads, ALTER TABLEs etc.. In those cases, S2 could well fall far
behind S1 during those long transactions and if S1 goes down at that
point there would be a backlog to apply.  But again, this only applies
to "apply" mode sync rep.

So it can occur in both cases, though it now looks to me that its less
important an issue in either case. So I think this doesn't rate the term
dangerous to describe it any longer.

Thanks for your careful thought and analysis on this.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: host name support in pg_hba.conf
Next
From: Joachim Wieland
Date:
Subject: Re: host name support in pg_hba.conf