Sync Rep at Oct 5 - Mailing list pgsql-hackers

From Simon Riggs
Subject Sync Rep at Oct 5
Date
Msg-id 1286267566.2025.901.camel@ebony
Whole thread Raw
Responses Re: Sync Rep at Oct 5
List pgsql-hackers
This is an attempt to compile everybody's stated viewpoints and come to
an understanding about where we are and where we want to go. The idea
from here is that we discuss what we are trying to achieve
(requirements) and then later come back to how (design).

To explain replication I use a simplified notation:
-> means "transferred to"
M means master
S means standby
e.g. M -> S describes the simply 2 node case of 1 master and 1 standby

There are a number of "main" use cases, with many sub-cases.

= Main Use Cases =

== Simplest Case ==

Simplest case is the 2-node case: M -> S
For sync rep, we want to be able to request that M's changes are
transferred safely to S. 

This case suffers from the problem that if S is unavailable we are not
sure whether to wait, or not. If we wait, we might wait forever, so we
need some mechanisms to prevent waiting forever. Variety of mechanisms
would be desirable
* timeout defined prior to start of wait
* explicit "end waiting" message sent to one/all backends
* ability to specify "don't wait" if standby is currently absent

This problem is particularly clear when the server starts. In current
streaming the standby can only connect when master is active, so there
may be a sequencing delay between master start and standby start. One
possible solution here is to prevent connection until required
robustness levels are available, rather than forcing master transactions
to wait until standbys connect and can ack receipt of data.

== Maximum Availability ==

A common architecture for avoiding the problems of waiting is to have
multiple standbys. We allow for the possibility that one of the standbys
is unavailable, by allowing sync rep against just one of the standbys,
rather than both of them.

M -> (S1, S2)

Thus, if S1 is down we still have M -> S2, or if S2 is down we still
have M -> S1.

This gives N+1 redundancy. 

Some people may wish to have N+k redundancy, like this M -> (S1, ...,
Sk), since the possibility of cluster unavailability increases
geometrically with large N.

(Would somebody like to work out the probability of cluster outage
capital P, when each server has identical probability of individual
outage, small p, and we have N nodes and k spares? ..Binomial series).

== Quorum Commit ==

Some people have also proposed that they would like to hear back from
more than one node, but not necessarily all standby nodes before
allowing commit, which has led to the phrase "quorum commit". Again,
this seems a likely requirement for large N.

Waiting for more than one response complicates the shmem data structures
since we have to book-keep state of each standby and follow a more
complex algorithm before releasing standbys from their wait.

== Load Balanced Server Farm ==

Some people have requested that we have a mode that requires
confirmation of sync rep against *all* standbys before allowing commit.

Even with this, data modifications would become visible on the standbys
at different times, so this configuration is still only "eventually
consistent", as are the other sync rep use cases. This config would get
geometrically less robust as new servers are added.

If we require waiting for "All" servers we would need to keep a
registered list of servers that ought to be connected, allowing us to
wait when one of those servers is unavailable. Automatic and explicit
registration have been proposed.

= Common Sub-Cases/Requirements =

== Standby Registration ==

Some people prefer an explicitly defined cluster config, with additional
specifications of what happens if some parts go down. This section is a
record of the various methods suggested

Other people prefer less than that.

There are a few possibilities:
* path config: we specify what happens standby to specific standby
* standby config: we specify what a standby can do, but not the specific
paths to that standby
* no standby-level config: we don't specify what happens on a
per-standby config.

We can also specify registration as
* explicit registration
* automatic registration - some servers register automatically
* no registration records kept at all

== Failover Configuration Minimisation ==

An important aspect of robustness is the ability to specify a
configuration that will remain in place even though 1 or more servers
have gone down.

It is desirable to specify sync rep requirements such that we do not
refer to individual servers, if possible. Each such rule necessarily
requires an "else" condition, possibly multiple else conditions.

It is desirable to avoid both of these
* the need to have different configuration files on each node
* the need to have configurations that only become active in case of
failure. These are known to be hard to test and very likely to be
misconfigured in the event of failover [I know a bank that was down for
a whole week when standby server's config was wrong and had never been
fully tested. The error was simple and obvious, but the fault showed
itself as a sporadic error that was difficult to trace]

== Sync Rep Performance ==

Sync Rep is a potential performance hit, and that hit is known to
increase as geographical distance increases.

We want to be able to specify the performance of some nodes so that we
have 4 levels of robustness:
async - doesn't wait for sync
recv - syncs when messages received by standby
fsync - syncs when messages written to disk by standby
apply - sync when messages applied to standby

== Transaction Controlled Robustness ==

It's desirable to be able to specify the robustness levels (in crude
form, which remote servers need to confirm before we end waiting). This
ability is an important tuning feature for sync replication.

Some people have mentioned simply a "level", so that once that level is
achieved by one server that is sufficient sync.

Others have said they want to confirm that their preferred failover
target has received the replicated data. [I am unsure as yet why
specifying an exact failover target is desirable, since it complicates
configuration when that exact server is down]

Others have proposed some combination of levels and specific servers
would provide sufficient sync, or some combination based upon weighted
values of different servers.

It has also been suggested that any level of complexity beyond the very
simple can be hidden within a "replication service", so there is
indirection between the application and the sync rep configuration.


== Path Minimization ==

We want to be able to minimize and control the path of data transfer,
* so that the current master doesn't have initiate transfer to all
dependent nodes, thereby reducing overhead on master
* so that if the path from current master to descendent is expensive we
would minimize network costs.

This requirement is commonly known as "relaying". 

In its most simply stated form, we want one standby to be able to get
WAL data from another standby. e.g. M -> S -> S. Stating the problem in
that way misses out on the actual requirement, since people would like
the arrangement to be robust in case of failures of M or any S. If we
specify the exact arrangement of paths then we need to respecify the
arrangement of paths if a server goes down.

For sync rep, we may have a situation where the master sends data to a
group of remote nodes. We would like to specify that the transfer is
synchronous to *one* of S1, S2 or S3, depending upon which is directly
connected to master at present moment. After that, data is transferred
locally to other nodes by relay. If S1 is down, then S2 would take the
place of S1 seamlessly and would feed S1 if it comes back up. The master
does not wish to specify which of (S1, S2, S3) it syncs with - as long
as *one* of them confirms it.

So from master perspective we have M -> {S1, S2, S3}
though in practice the physical arrangement will be M -> S1 -> (S2, S3)
unless S1 is down, in which case the arrangement would be
M -> S2 -> S3 or M -> S2 -> (S3, S1) if S1 comes back up.

Note the curly brackets to denote a difference between this case and the
M -> (S1, S2, S3) arrangement where all 3 paths are active.

== Lower Priority System Access ==

e.g. a Development or Test database server

The requirement here is for a lower priority server to be attached to
the replication cluster. It receives data normally from the master, but
we would like to explicitly specify that it is not part of the high
availability framework and will never be a failover target.

So we need to be able to specify either/all
* explicitly ignore this server
* implicitly ignore this server because it is not on the list of
explicitly included servers

So we must describe config at the server level, somewhere. This
requirement differs from the ability to specify robustness level on the
master.


There may be mistakes or omissions in the above. I've mentioned a few
places where I note the request but don't yet understand the argument as
to why that aspect is desirable.

Please add any aspects I've missed. I'll dump the edited version to the
Wiki when we've had some feedback on this.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services



pgsql-hackers by date:

Previous
From: Tatsuo Ishii
Date:
Subject: pg_filedump for 9.0?
Next
From: Dean Rasheed
Date:
Subject: Re: wip: functions median and percentile