Thread: Sync Rep at Oct 5

Sync Rep at Oct 5

From
Simon Riggs
Date:
This is an attempt to compile everybody's stated viewpoints and come to
an understanding about where we are and where we want to go. The idea
from here is that we discuss what we are trying to achieve
(requirements) and then later come back to how (design).

To explain replication I use a simplified notation:
-> means "transferred to"
M means master
S means standby
e.g. M -> S describes the simply 2 node case of 1 master and 1 standby

There are a number of "main" use cases, with many sub-cases.

= Main Use Cases =

== Simplest Case ==

Simplest case is the 2-node case: M -> S
For sync rep, we want to be able to request that M's changes are
transferred safely to S. 

This case suffers from the problem that if S is unavailable we are not
sure whether to wait, or not. If we wait, we might wait forever, so we
need some mechanisms to prevent waiting forever. Variety of mechanisms
would be desirable
* timeout defined prior to start of wait
* explicit "end waiting" message sent to one/all backends
* ability to specify "don't wait" if standby is currently absent

This problem is particularly clear when the server starts. In current
streaming the standby can only connect when master is active, so there
may be a sequencing delay between master start and standby start. One
possible solution here is to prevent connection until required
robustness levels are available, rather than forcing master transactions
to wait until standbys connect and can ack receipt of data.

== Maximum Availability ==

A common architecture for avoiding the problems of waiting is to have
multiple standbys. We allow for the possibility that one of the standbys
is unavailable, by allowing sync rep against just one of the standbys,
rather than both of them.

M -> (S1, S2)

Thus, if S1 is down we still have M -> S2, or if S2 is down we still
have M -> S1.

This gives N+1 redundancy. 

Some people may wish to have N+k redundancy, like this M -> (S1, ...,
Sk), since the possibility of cluster unavailability increases
geometrically with large N.

(Would somebody like to work out the probability of cluster outage
capital P, when each server has identical probability of individual
outage, small p, and we have N nodes and k spares? ..Binomial series).

== Quorum Commit ==

Some people have also proposed that they would like to hear back from
more than one node, but not necessarily all standby nodes before
allowing commit, which has led to the phrase "quorum commit". Again,
this seems a likely requirement for large N.

Waiting for more than one response complicates the shmem data structures
since we have to book-keep state of each standby and follow a more
complex algorithm before releasing standbys from their wait.

== Load Balanced Server Farm ==

Some people have requested that we have a mode that requires
confirmation of sync rep against *all* standbys before allowing commit.

Even with this, data modifications would become visible on the standbys
at different times, so this configuration is still only "eventually
consistent", as are the other sync rep use cases. This config would get
geometrically less robust as new servers are added.

If we require waiting for "All" servers we would need to keep a
registered list of servers that ought to be connected, allowing us to
wait when one of those servers is unavailable. Automatic and explicit
registration have been proposed.

= Common Sub-Cases/Requirements =

== Standby Registration ==

Some people prefer an explicitly defined cluster config, with additional
specifications of what happens if some parts go down. This section is a
record of the various methods suggested

Other people prefer less than that.

There are a few possibilities:
* path config: we specify what happens standby to specific standby
* standby config: we specify what a standby can do, but not the specific
paths to that standby
* no standby-level config: we don't specify what happens on a
per-standby config.

We can also specify registration as
* explicit registration
* automatic registration - some servers register automatically
* no registration records kept at all

== Failover Configuration Minimisation ==

An important aspect of robustness is the ability to specify a
configuration that will remain in place even though 1 or more servers
have gone down.

It is desirable to specify sync rep requirements such that we do not
refer to individual servers, if possible. Each such rule necessarily
requires an "else" condition, possibly multiple else conditions.

It is desirable to avoid both of these
* the need to have different configuration files on each node
* the need to have configurations that only become active in case of
failure. These are known to be hard to test and very likely to be
misconfigured in the event of failover [I know a bank that was down for
a whole week when standby server's config was wrong and had never been
fully tested. The error was simple and obvious, but the fault showed
itself as a sporadic error that was difficult to trace]

== Sync Rep Performance ==

Sync Rep is a potential performance hit, and that hit is known to
increase as geographical distance increases.

We want to be able to specify the performance of some nodes so that we
have 4 levels of robustness:
async - doesn't wait for sync
recv - syncs when messages received by standby
fsync - syncs when messages written to disk by standby
apply - sync when messages applied to standby

== Transaction Controlled Robustness ==

It's desirable to be able to specify the robustness levels (in crude
form, which remote servers need to confirm before we end waiting). This
ability is an important tuning feature for sync replication.

Some people have mentioned simply a "level", so that once that level is
achieved by one server that is sufficient sync.

Others have said they want to confirm that their preferred failover
target has received the replicated data. [I am unsure as yet why
specifying an exact failover target is desirable, since it complicates
configuration when that exact server is down]

Others have proposed some combination of levels and specific servers
would provide sufficient sync, or some combination based upon weighted
values of different servers.

It has also been suggested that any level of complexity beyond the very
simple can be hidden within a "replication service", so there is
indirection between the application and the sync rep configuration.


== Path Minimization ==

We want to be able to minimize and control the path of data transfer,
* so that the current master doesn't have initiate transfer to all
dependent nodes, thereby reducing overhead on master
* so that if the path from current master to descendent is expensive we
would minimize network costs.

This requirement is commonly known as "relaying". 

In its most simply stated form, we want one standby to be able to get
WAL data from another standby. e.g. M -> S -> S. Stating the problem in
that way misses out on the actual requirement, since people would like
the arrangement to be robust in case of failures of M or any S. If we
specify the exact arrangement of paths then we need to respecify the
arrangement of paths if a server goes down.

For sync rep, we may have a situation where the master sends data to a
group of remote nodes. We would like to specify that the transfer is
synchronous to *one* of S1, S2 or S3, depending upon which is directly
connected to master at present moment. After that, data is transferred
locally to other nodes by relay. If S1 is down, then S2 would take the
place of S1 seamlessly and would feed S1 if it comes back up. The master
does not wish to specify which of (S1, S2, S3) it syncs with - as long
as *one* of them confirms it.

So from master perspective we have M -> {S1, S2, S3}
though in practice the physical arrangement will be M -> S1 -> (S2, S3)
unless S1 is down, in which case the arrangement would be
M -> S2 -> S3 or M -> S2 -> (S3, S1) if S1 comes back up.

Note the curly brackets to denote a difference between this case and the
M -> (S1, S2, S3) arrangement where all 3 paths are active.

== Lower Priority System Access ==

e.g. a Development or Test database server

The requirement here is for a lower priority server to be attached to
the replication cluster. It receives data normally from the master, but
we would like to explicitly specify that it is not part of the high
availability framework and will never be a failover target.

So we need to be able to specify either/all
* explicitly ignore this server
* implicitly ignore this server because it is not on the list of
explicitly included servers

So we must describe config at the server level, somewhere. This
requirement differs from the ability to specify robustness level on the
master.


There may be mistakes or omissions in the above. I've mentioned a few
places where I note the request but don't yet understand the argument as
to why that aspect is desirable.

Please add any aspects I've missed. I'll dump the edited version to the
Wiki when we've had some feedback on this.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep at Oct 5

From
Steve Singer
Date:
On 10-10-05 04:32 AM, Simon Riggs wrote:
>
> This is an attempt to compile everybody's stated viewpoints and come to
> an understanding about where we are and where we want to go. The idea
> from here is that we discuss what we are trying to achieve
> (requirements) and then later come back to how (design).

Great start on summarizing the discussions.  Getting a summary of the 
requirements in one place will help people who haven't been diligent in 
following all the sync-rep email threads stay involved.

<snip>

> == Failover Configuration Minimisation ==
>
> An important aspect of robustness is the ability to specify a
> configuration that will remain in place even though 1 or more servers
> have gone down.
>
> It is desirable to specify sync rep requirements such that we do not
> refer to individual servers, if possible. Each such rule necessarily
> requires an "else" condition, possibly multiple else conditions.
>
> It is desirable to avoid both of these
> * the need to have different configuration files on each node
> * the need to have configurations that only become active in case of
> failure. These are known to be hard to test and very likely to be
> misconfigured in the event of failover [I know a bank that was down for
> a whole week when standby server's config was wrong and had never been
> fully tested. The error was simple and obvious, but the fault showed
> itself as a sporadic error that was difficult to trace]
>

Also on the topic of failover how do we want to deal with the master 
failing over.   Say M->{S1,S2} and M fails and we promote S1 to M1.  Can 
M1->S2?     What if S2 was further along in processing than S1 when M 
failed?  I don't think we want to take on this complexity for 9.1 but 
this means that after M fails you won't have a synchronous replica until 
you rebuild or somehow reset S2.




> == Sync Rep Performance ==
>
> Sync Rep is a potential performance hit, and that hit is known to
> increase as geographical distance increases.
>
> We want to be able to specify the performance of some nodes so that we
> have 4 levels of robustness:
> async - doesn't wait for sync
> recv - syncs when messages received by standby
> fsync - syncs when messages written to disk by standby
> apply - sync when messages applied to standby

Will read-only queries running on a slave hold up transactions from 
being applied on that slave?   I suspect that for most people running 
with 'apply' they would want the answer to be 'no'.  Are we going to 
revisit the standby query cancellation discussion?



> == Path Minimization ==
>
> We want to be able to minimize and control the path of data transfer,
> * so that the current master doesn't have initiate transfer to all
> dependent nodes, thereby reducing overhead on master
> * so that if the path from current master to descendent is expensive we
> would minimize network costs.
>
> This requirement is commonly known as "relaying".
>
> In its most simply stated form, we want one standby to be able to get
> WAL data from another standby. e.g. M ->  S ->  S. Stating the problem in
> that way misses out on the actual requirement, since people would like
> the arrangement to be robust in case of failures of M or any S. If we
> specify the exact arrangement of paths then we need to respecify the
> arrangement of paths if a server goes down.

Are we going to allow these paths to be reconfigured on a live cluster? 
If we have M->S1->S2 and we want to reconfigure S2 to read from M then 
S2 needs to get the data that has already been committed on S1 from 
somewhere (either S1 or M).  This has solutions but it adds to the 
complexity.  Maybe not for 9.1






Re: Sync Rep at Oct 5

From
Simon Riggs
Date:
On Tue, 2010-10-05 at 11:30 -0400, Steve Singer wrote:

> Will read-only queries running on a slave hold up transactions from 
> being applied on that slave?   I suspect that for most people running 
> with 'apply' they would want the answer to be 'no'.  Are we going to 
> revisit the standby query cancellation discussion?

Once we have a feedback channel from standby to master its a simple
matter to add some feedback to avoid many query cancellations. 

That was the original plan for 9.0 but we changed from sync rep to
streaming rep so late in the cycle that there was no time to do it that
way.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services



Re: Sync Rep at Oct 5

From
Simon Riggs
Date:
On Tue, 2010-10-05 at 11:30 -0400, Steve Singer wrote:

> Also on the topic of failover how do we want to deal with the master 
> failing over.   Say M->{S1,S2} and M fails and we promote S1 to M1.  Can 
> M1->S2?     What if S2 was further along in processing than S1 when M 
> failed?  I don't think we want to take on this complexity for 9.1 but 
> this means that after M fails you won't have a synchronous replica until 
> you rebuild or somehow reset S2.

Those are problems that can be resolved, but that is the current state.
The trick, I guess, is to promote the correct standby.

Those are generic issues, not related to any specific patch. Thanks for
keeping those issues in the limelight.

> > == Path Minimization ==
> >
> > We want to be able to minimize and control the path of data transfer,
> > * so that the current master doesn't have initiate transfer to all
> > dependent nodes, thereby reducing overhead on master
> > * so that if the path from current master to descendent is expensive we
> > would minimize network costs.
> >
> > This requirement is commonly known as "relaying".
> >
> > In its most simply stated form, we want one standby to be able to get
> > WAL data from another standby. e.g. M ->  S ->  S. Stating the problem in
> > that way misses out on the actual requirement, since people would like
> > the arrangement to be robust in case of failures of M or any S. If we
> > specify the exact arrangement of paths then we need to respecify the
> > arrangement of paths if a server goes down.
> 
> Are we going to allow these paths to be reconfigured on a live cluster? 
> If we have M->S1->S2 and we want to reconfigure S2 to read from M then 
> S2 needs to get the data that has already been committed on S1 from 
> somewhere (either S1 or M).  This has solutions but it adds to the 
> complexity.  Maybe not for 9.1

If you switch from M -> S1 -> S2 to M -> (S1, S2) it should work fine.
At the moment that needs a shutdown/restart, but that could easily be
done with a disconnect/reconnect following a file reload.

The problem is how much WAL is stored on (any) node. Currently that is
wal_keep_segments, which doesn't work very well, but I've seen no better
ideas that cover all important cases.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services




Re: Sync Rep at Oct 5

From
Fujii Masao
Date:
On Wed, Oct 6, 2010 at 4:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> The problem is how much WAL is stored on (any) node. Currently that is
> wal_keep_segments, which doesn't work very well, but I've seen no better
> ideas that cover all important cases.

What about allowing the master to read and send WAL from the archive?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Sync Rep at Oct 5

From
Steve Singer
Date:
On 10-10-07 05:52 AM, Fujii Masao wrote:
> On Wed, Oct 6, 2010 at 4:06 PM, Simon Riggs<simon@2ndquadrant.com>  wrote:
>> The problem is how much WAL is stored on (any) node. Currently that is
>> wal_keep_segments, which doesn't work very well, but I've seen no better
>> ideas that cover all important cases.
>
> What about allowing the master to read and send WAL from the archive?
>
> Regards,

Then you have to deal with telling the archive how long it needs to keep 
WAL segments because the master might ask for them back.  If the archive 
is remote from the master then you have some extra network copying going 
on.  It would be better to let the slave being reconfigured to read the 
missing WAL from the archive.





Re: Sync Rep at Oct 5

From
Fujii Masao
Date:
On Thu, Oct 7, 2010 at 8:46 PM, Steve Singer <ssinger@ca.afilias.info> wrote:
> Then you have to deal with telling the archive how long it needs to keep WAL
> segments because the master might ask for them back.

Yeah, it's not easy to determine how long we should keep the archived WAL files.
We need to calculate which WAL file is deletable according to the progress of
each standby and the state of the stored base backups which still might be used
for PITR.

> If the archive is
> remote from the master then you have some extra network copying going on.

Yep, so I think that the master (i.e., walsender) should read the archived
WAL file by using restore_command specified by users. If the archive is
remote from the master, then we would need to specify something like scp in
restore_command. Also, even if you compress the archived WAL file by using
pg_compress, the master can decompress it by using pg_decompress in
restore_command and transfer it.

> It would be better to let the slave being reconfigured to read the missing
> WAL from the archive.

That's one of choices.

But I've heard that some people don't want to set up the shared archive area
which can be accessed by the master and the standby. For example, they feel
that it's complex to configure NFS server or automatic-scp-without-password
setting for sharing the archived WAL files.

Currently we have to increase wal_keep_segments to work around that problem.
But the pg_xlog disk space is usually small and not suitable to keep many
WAL files. So we might be unable to increase wal_keep_segments.

If we allow the master to stream WAL files from the archive, we don't need
to increase wal_keep_segments and set up such a complex configuration. So
this idea is one of useful choices, I think.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Sync Rep at Oct 5

From
Robert Haas
Date:
On Thu, Oct 7, 2010 at 9:08 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Thu, Oct 7, 2010 at 8:46 PM, Steve Singer <ssinger@ca.afilias.info> wrote:
>> Then you have to deal with telling the archive how long it needs to keep WAL
>> segments because the master might ask for them back.
>
> Yeah, it's not easy to determine how long we should keep the archived WAL files.
> We need to calculate which WAL file is deletable according to the progress of
> each standby and the state of the stored base backups which still might be used
> for PITR.
>
>> If the archive is
>> remote from the master then you have some extra network copying going on.
>
> Yep, so I think that the master (i.e., walsender) should read the archived
> WAL file by using restore_command specified by users. If the archive is
> remote from the master, then we would need to specify something like scp in
> restore_command. Also, even if you compress the archived WAL file by using
> pg_compress, the master can decompress it by using pg_decompress in
> restore_command and transfer it.
>
>> It would be better to let the slave being reconfigured to read the missing
>> WAL from the archive.
>
> That's one of choices.
>
> But I've heard that some people don't want to set up the shared archive area
> which can be accessed by the master and the standby. For example, they feel
> that it's complex to configure NFS server or automatic-scp-without-password
> setting for sharing the archived WAL files.
>
> Currently we have to increase wal_keep_segments to work around that problem.
> But the pg_xlog disk space is usually small and not suitable to keep many
> WAL files. So we might be unable to increase wal_keep_segments.
>
> If we allow the master to stream WAL files from the archive, we don't need
> to increase wal_keep_segments and set up such a complex configuration. So
> this idea is one of useful choices, I think.

I'm not sure anyone other than yourself has endorsed this idea, but in
any case it seems off the critical path for getting this feature
committed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company