Re: Issues with Quorum Commit - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Issues with Quorum Commit
Date
Msg-id 1286313540.2025.2923.camel@ebony
Whole thread Raw
In response to Re: Issues with Quorum Commit  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Issues with Quorum Commit
List pgsql-hackers
On Tue, 2010-10-05 at 13:45 -0700, Jeff Davis wrote:
> On Tue, 2010-10-05 at 12:11 -0700, Josh Berkus wrote:
> > B. Eventual Inconsistency
> > -------------------------
> > If we have a quorum commit, it's possible for any individual standby to
> > be indefinitely ahead of any standby which is not needed by the quorum.
> >  This means that:
> > 
> > -- There is no clear criteria for when a standby which is not needed for
> > quorum should be considered no longer a synch standby, and
> > -- Applications cannot make assumptions that synch rep promises some
> > specific window of synchronicity, eliminating a lot of the value of
> > quorum commit.
> 
> Point B seems particularly dangerous.
> 
> When you lose one of the systems and the lagging server becomes required
> for quorum, then all of a sudden you could be facing a huge delay to
> commit the next transaction (because it needs to catch up on a lot of
> WAL replay). This can happen even without a network problem at all, and
> seems very likely to result in the lagging system being considered
> "down" due to a timeout. Not good, because the reason it is required for
> quorum is because another standby just went down.
> 
> In other words, a lagging standby combined with a timeout mechanism is
> essentially useless, because it will never catch up in time to be a part
> of the quorum.

Thanks for explaining what was meant.

This issue is a serious problem with the apply to *all* servers that
Heikki has been describing as being a useful use case. We register a
standby, it goes down and we decide to wait for it. Then when it does
come back up it takes ages to catch up.

This is really the nail in the coffin for the "All" servers use case,
and a significant blow to the requirement for standby registration.

If we use N+1 redundancy as I have explained, then this situation does
not occur until you have less than N standbys available. But then it's
no surprise that RAID-5 won't work with 4 drives either.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services



pgsql-hackers by date:

Previous
From: Bernd Helmle
Date:
Subject: Re: Re: starting to review the Extend NOT NULL representation to pg_constraint patch
Next
From: Robert Haas
Date:
Subject: Re: Issues with Quorum Commit