Re: Synchronous replication - patch status inquiry - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Synchronous replication - patch status inquiry
Date
Msg-id 1283373781.1834.290.camel@ebony
Whole thread Raw
In response to Re: Synchronous replication - patch status inquiry  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
List pgsql-hackers
On Wed, 2010-09-01 at 13:23 +0300, Heikki Linnakangas wrote:
> On 01/09/10 10:53, Fujii Masao wrote:
> > Before discussing about that, we should determine whether registering
> > standbys in master is really required. It affects configuration a lot.
> > Heikki thinks that it's required, but I'm still unclear about why and
> > how.
> >
> > Why do standbys need to be registered in master? What information
> > should be registered?
> 
> That requirement falls out from the handling of disconnected standbys. 
> If a standby is not connected, what does the master do with commits? If 
> the answer is anything else than acknowledge them to the client 
> immediately, as if the standby never existed, the master needs to know 
> what standby servers exist. Otherwise it can't know if all the standbys 
> are connected or not.

"All the standbys" presupposes that we know what they are, i.e. we have
registered them, so I see that argument as circular. Quorum commit does
not need registration, so quorum commit is the "easy to implement"
option and registration is the more complex later feature. I don't have
a problem with adding registration later and believe it can be done
later without issues.

> >> What does synchronous replication mean, when is a transaction
> >> acknowledged as committed?
> >
> > I proposed four synchronization levels:
> >
> > 1. async
> >    doesn't make transaction commit wait for replication, i.e.,
> >    asynchronous replication. This mode has been already supported in
> >    9.0.
> >
> > 2. recv
> >    makes transaction commit wait until the standby has received WAL
> >    records.
> >
> > 3. fsync
> >    makes transaction commit wait until the standby has received and
> >    flushed WAL records to disk
> >
> > 4. replay
> >    makes transaction commit wait until the standby has replayed WAL
> >    records after receiving and flushing them to disk
> >
> > OTOH, Simon proposed the quorum commit feature. I think that both
> > is required for various our use cases. Thought?
> 
> I'd like to keep this as simple as possible, yet flexible so that with 
> enough scripting and extensions, you can get all sorts of behavior. I 
> think quorum commit falls into the "extension" category; if you're setup 
> is complex enough, it's going to be impossible to represent that in our 
> config files no matter what. But if you write a little proxy, you can 
> implement arbitrary rules there.
> 
> I think recv/fsync/replay should be specified in the standby. 

I think the wait mode (i.e. recv/fsync/replay or others) should be
specified in the master. This allows the application to specify whatever
level of protection it requires, and also allows the behaviour to be
different for user-specifiable parts of the application. As soon as you
set this on the standby then you have the one-size fits all approach to
synchronisation.

We already know performance of synchronous rep is poor, which is exactly
why I want to be able to control it at the application level. Fine
grained control is important, otherwise we may as well just use DRBD and
skip this project completely, since we already have that. It will also
be a feature that no other database has, taking us truly beyond what has
gone before.

The master/standby decision is not something that is easily changed.
Whichever we decide now will be the thing we stick with.

> It has no 
> direct effect on the master, the master would just relay the setting to 
> the standby when it connects, or the standby would send multiple 
> XLogRecPtrs and let the master decide when the WAL is persistent enough. 
> And what if you write a proxy that has some other meaning of "persistent 
> enough"? Like when it has been written to the OS buffers but not yet 
> fsync'd, or when it has been fsync'd to at least one standby and 
> received by at least three others. recv/fsync/replay is not going to 
> represent that behavior well.
> 
> "sync vs async" on the other hand should be specified in the master, 
> because it has a direct impact on the behavior of commits in the master.
> 



> I propose a configuration file standbys.conf, in the master:
> 
> # STANDBY NAME    SYNCHRONOUS   TIMEOUT
> importantreplica  yes           100ms
> tempcopy          no            10s
> 
> Or perhaps this should be stored in a system catalog.

That part sounds like complexity that can wait until later. I would not
object if you really want this, but would prefer it to look like this:

# STANDBY NAME    DEFAULT_WAIT_MODE   TIMEOUT
importantreplica  sync               100ms
tempcopy          async                10s

You don't *have* to use the application level control if you don't want
it. But its an important capability for real world apps, since the
alternative is deliberately splitting an application across two database
servers each with different wait modes.

> >> What to do if a standby server dies and never
> >> acknowledges a commit?
> >
> > The master's reaction to that situation should be configurable. So
> > I'd propose new configuration parameter specifying the reaction.
> > Valid values are:
> >
> > - standalone
> >    When the master has waited for the ACK much longer than the timeout
> >    (or detected the failure of the standby), it closes the connection
> >    to the standby and restarts transactions.
> >
> > - down
> >    When that situation occurs, the master shuts down immediately.
> >    Though this is unsafe for the system requiring high availability,
> >    as far as I recall, some people wanted this mode in the previous
> >    discussion.
> 
> Yeah, though of course you might want to set that per-standby too..
> 
> 
> Let's step back a bit and ask what would be the simplest thing that you 
> could call "synchronous replication" in good conscience, and also be 
> useful at least to some people. Let's leave out the "down" mode, because 
> that requires registration. We'll probably have to do registration at 
> some point, but let's take as small steps as possible.
> 
> Without the "down" mode in the master, frankly I don't see the point of 
> the "recv" and "fsync" levels in the standby. Either way, when the 
> master acknowledges a commit to the client, you don't know if it has 
> made it to the standby yet because the replication connection might be 
> down for some reason.
> 
> That leaves us the 'replay' mode, which *is* useful, because it gives 
> you the guarantee that when the master acknowledges a commit, it will 
> appear committed in all hot standby servers that are currently 
> connected. With that guarantee you can build a reliable cluster with 
> something pgpool-II where all writes go to one node, and reads are 
> distributed to multiple nodes.
> 
> I'm not sure what we should aim for in the first phase. But if you want 
> as little code as possible yet have something useful, I think 'replay' 
> mode with no standby registration is the way to go.

I don't see it as any more code to implement.

When the standby replies, it can return
* latest LSN received
* latest LSN fsynced
* latest LSN replayed
etc

We then release waiting committers on the master according to which of
the above they said they want to wait for. The standby does *not* need
to know the wishes of transactions on the master.

Note that means that receiving, fsyncing and replaying can all progress
as an asynchronous pipeline, giving great overall throughput.

Once you accept that there are multiple modes, then the actual number of
wait modes is unimportant. It's just an array of [NUM_WAIT_MODES], so
the project need not be delayed just because we have 2, 3 or 4 wait
modes.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Fix for pg_upgrade's forcing pg_controldata into English
Next
From: Alvaro Herrera
Date:
Subject: Re: Fix for pg_upgrade's forcing pg_controldata into English