Sync Replication with transaction-controlled durability - Mailing list pgsql-hackers

I'm working on a patch to implement synchronous replication for
PostgreSQL, with user-controlled durability specified on the master. The
design also provides high throughput by allowing concurrent processes to
handle the WAL stream. The proposal requires only 3 new parameters and
takes into account much community feedback on earlier ideas.

The patch is fairly simple, though reworking it to sit on top of latches
means I haven't quite finished it yet. I was requested to explain the
design as soon as possible, so am posting this ahead of the patch.

= WHAT'S DIFFERENT ABOUT THIS PATCH?

* Low complexity of code on Standby
* User control: All decisions to wait take place on master, allowing
fine-grained control of synchronous replication. Max replication level
can also be set on the standby.
* Low bandwidth: Very small response packet size with no increase in
number of responses when system is under high load means very little
additional bandwidth required
* Performance: Standby processes work concurrently to give good overall
throughput on standby and minimal latency in all modes. 4 performance
options don't interfere with each other, so offer different levels of
performance/durability alongside each other.

These are major wins for PostgreSQL project over and above the basic
sync rep feature.

= SYNCHRONOUS REPLICATION OVERVIEW

Synchronous replication offers the guarantee that all changes made by a
transaction have been transferred to remote standby nodes. This is an
extension to the standard level of durability offered by a transaction
commit.

When synchronous replication is requested the transaction will wait
after it commits until it receives confirmation that the transfer has
been successful. Waiting for confirmation increases the user's certainty
that the transfer has taken place but it also necessarily increases the
response time for the requesting transaction. Synchronous replication
usually requires carefully planned and placed standby servers to ensure
applications perform acceptably. Waiting doesn't utilise system
resources, but transaction locks continue to be held until the transfer
is confirmed. As a result, incautious use of synchronous replication
will lead to reduced performance for database applications.

It may seem that there is a simple choice between durability and
performance. However, there is often a close relationship between the
importance of data and how busy the database needs to be, so this is
seldom a simple choice. With this patch, PostgreSQL now provides a range
of features designed to allow application architects to design a system
that has both good overall performance and yet good durability of the
most important data assets.

PostgreSQL allows the application designer to specify the durability
level required via replication. This can be specified for the system
overall, though it can also be specified for individual transactions.
This allows to selectively provide highest levels of protection for
critical data. 

For example we, an application might consist of two types of work:
* 10% of changes are changes to important customer details
* 90% of changes are less important data that the business can more
easily survive if it is lost, such as chat messages between users.

With sync replication options specified at the application level (on the
master) we can offer sync rep for the most important changes, without
slowing down the bulk of the total workload. Application level options
are an important and practical tool for allowing the benefits of
synchronous replication for high performance applications.

Without sync rep options specified at app level, we would have a choice
of either slowing down 90% of the workload because 10% of it is
important. Or giving up our durability goals because of performance. Or
splitting those two functions onto separate database servers so that we
can set options differently on each. None of those 3 options is truly
attractive.

PostgreSQL also allows the system administrator the ability to specify
the service levels offered by standby servers. This allows multiple
standby servers to work together in various roles within a server farm.

Control of this feature relies on just 3 parameters:
On the master we can set

* synchronous_replication
* synchronous_replication_timeout

On the standby we can set

* synchronous_replication_service

These are explained in more detail in the following sections.

= USER'S OVERVIEW

Two new USERSET parameters on the master control this 
* synchronous_replication = async (default) | recv | fsync | apply
* synchronous_replication_timeout = 0+ (0 means never timeout)
(default timeout 10sec)

synchronous_replication = async is the default and means that no
synchronisaton is requested and so the commit will not wait. This is the
fastest setting. The word async is short for "asynchronous" and you may
see the term asynchronous replication discussed.

Other settings refer to progressively higher levels of durability. The
higher the level of durability requested, the longer the wait for that
level of durability to be achieved.

The precise meaning of the synchronous_replication settings is
* async  - commit does not wait for a standby before replying to user
* recv - commit waits until standby has received WAL
* fsync - commit waits until standby has received and fsynced WAL
* apply - commit waits until standby has received, fsynced and applied
This provides a simple, easily understood mechanism - and one that in
its default form is very similar to other RDBMS (e.g. Oracle).

Note that in apply mode it is possible that the changes could be
accessible on the standby before the transaction that made the change
has been notified that the change is complete. Minor issue.

Network delays may occur and the standby may also crash. If no reply is
received within the timeout we raise a NOTICE and then return successful
commit (no other action is possible). Note that it is possible to
request that we never timeout, so if no standby is available we wait for
it one to appear.

When user commits, if the master does not have a currently connected
standby offering the required level of replication it will pick the next
best available level of replication. It is up to the sysadmin to provide
sufficient range of standby nodes to ensure at least one is available to
meet the requested service levels.

If multiple standbys exist, the first standby to reply that the desired
level of durability has been achieved will release the waiting commit on
the master. Other options are available also via a plugin.

== ADMINISTRATOR'S OVERVIEW

On the standby we specify the highest type of replication service
offered by this standby server. This information is passed to the master
server when the standby connects for replication.

This allows sysadmins to designate preferred standbys. It also allows
sysadmins to completely refuse to offer a synchronous replication
service, allowing a master to explicitly avoid synchronisation across
low bandwidth or high latency links.

An additional parameter can be set in recovery.conf on the standby

* synchronous_replication_service = async (def) | recv | fsync | apply


= IMPLEMENTATION

Some aspects can be changed without significantly altering basic
proposal, for example master-specified standby registration wouldn't
really alter this very much.

== STANDBY

Master-controlled sync rep means that all user wait logic is centred on
the master. The details of sync rep requests on the master are not sent
to the standby, so there is no additional master to standby traffic nor
standby-side bookkeeping overheads. It also reduces complexity of
standby code.

On the standby side the WAL Writer now operates during recovery. This
frees the WALReceiver to spend more time sending and receiving messages,
thereby minimising latency for users choosing the "recv" option. We now
have 3 processes handling WAL in an asynchronous pipeline: WAL Receiver
reads WAL data from the libpq connection then writes it to the WAL file,
the WAL Writer then fsyncs the WAL file and then the Startup process
replays the WAL. These processes act independently, so WAL pointers
(LSNs) are defined as WALReceiverLSN >= WALWriterLSN >= StartupLSN

For each new message WALReceiver gets from master we issue a reply. Each
reply sends the current state of the 3 LSNs, so the reply message size
is only 28 bytes. Replies are sent half-duplex, i.e. we don't reply
while a new message is arriving.

Note that there is absolutely not one reply per transaction on the
master. The standby knows nothing about what has been requested on the
master - replies always refer to the latest standby state and
effectively batch the responses.

We act according to the requested synchronous_replication_service
* async -  no replies are sent
* recv -   replies are sent upon receipt only
* fsync -  replies are sent upon receipt and following fsync only
* apply -  replies are sent following receipt, fsync and apply.

Replies are sent at the next available opportunity.

In apply mode, when the WALReceiver is completely quiet this means we
send 3 reply messages - one at recv, one at fsync and one at apply. When
WALreceiver is busy the volume of messages does *not* increase since the
reply can't be sent until the current incoming message has been
received, after which we were going to reply anyway so it is not an
additional message. This means we piggyback an "apply" response onto a
later "recv" reply. As a result we get minimum response times in *all*
modes and maximum throughput is not impaired at all.

When each new messages arrives from master the WALreceiver will write
the new data to the WAL file, wake the WALwriter and then reply. Each
new message from master receives a reply. If no further WAL data has
been received the WALreceiver waits on the latch. If the WALReceiver is
woken by WALWriter or Startup then it will reply to master with a
message, even if no new WAL has been received.

So in both recv, fsync and apply cases a message as soon as possible to
master, so in all cases the wait time is minimised.

When WALwriter is woken it sees if there is outstanding WAL data and if
so fsyncs it and wakes both WALreceiver and Startup. When no WAL remains
it waits on the latch.

Startup process will wake WALreceiver when it has got to the end of the
latest chunk of WAL. If no further WAL is available then it waits on its
latch.

== MASTER

When user backends request sync rep they wait in a queue ordered by
requested LSN. A separate queue exists for each request mode.

WALSender receives the 3 LSNs from the standby. It then wakes backends
in sequence from each queue.

We provide a single wakeup rule: first WALSender to reply with the
requested XLogRecPtr will wake the backend. This guarantees that the WAL
data for the commit is transferred as requested to at least one standby.
That is sufficient for the use cases we have discussed.

More complex wakeup rules would be possible via a plugin.

Wait timeout would be set by individual backends with a timer, just as
we do for statement_timeout.

= CODE

Total code to implement this is low. Breaks down into 5 areas
* Zoltan's libpq changes, included almost verbatim; fairly modular, so
easy to replace with something we like better
* A new module syncrep.c and syncrep.h handle the backend wait/wakeup
* Light changes to allow streaming rep to make appropriate calls
* Small amount of code to allow WALWriter to be active in recovery
* Parameter code
No docs yet.

The patch works on top of latches, though does not rely upon them for
its bulk performance characteristics. Latches only improve response time
for very low transaction rates; latches provide no additional throughput
for medium to high transaction rates.

= PERFORMANCE ANALYSIS

Since we reply to each new chunk sent from master, "recv" mode has
absolutely minimal latency, especially since WALreceiver no longer
performs majority of fsyncs, as in 9.0 code. WALreceiver does not wait
for fsync or apply actions to complete before we reply, so fsync and
apply modes will always wait at least 2 standby->master messages which
is appropriate because those actions will typically occur much later. 

This response mechanism offers highest responsive performance achievable
in "recv" mode and very good throughput under load. Note that the
different modes do not interfere with each other and can co-exist
happily while providing highest performance.

Starting WALWriter is helpful, no matter what the
synchronous_replication_service specified.

Can we optimise the sending of reply messages so that only chunks that
contain a commit deserve a reply? We could, but then we'd need to do
extra work on the master to do bookkeeping of that. It would need to be
demonstrated that there is a performance issue big enough to be worth
the overhead on master and extra code.

Is there an optimisation from reducing the number of options the standby
provides? The architecture on the standby side doesn't rely heavily on
the service level specified, nor does it rely in any way on the actual
sync rep mode specified on master. No further simplification is
possible.


= NOT YET IMPLEMENTED

* Timeout code & NOTICE
* Code and test plugin 
* Loops in walsender, walwriter and receiver treat shutdown incorrectly

I haven't yet looked at Fujii's code for this, not even sure where it
is, though hope to do so in the future. Zoltan's libpq code is the only
part of that patch used.

So far I have spent 3.5 days on this and expect to complete tomorrow. I
think that throws out the argument that this proposal is too complex to
develop in this release. 

= OTHER ISSUES

* How should master behave when we shut it down?
* How should standby behave when we shut it down?


I will post my patch on this thread when it is available.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services




pgsql-hackers by date:

Previous
From: Markus Wanner
Date:
Subject: Re: bg worker: patch 1 of 6 - permanent process
Next
From: Markus Wanner
Date:
Subject: Re: bg worker: patch 1 of 6 - permanent process