Thread: Synchronous replication patch built on SR

Synchronous replication patch built on SR

From

Boszormenyi Zoltan

Date:

30 April 2010, 12:47:32

Hi,

attached is a patch that does $SUBJECT, we are submitting it for 9.1.
I have updated it to today's CVS after the "wal_level" GUC went in.

How does it work?

First, the walreceiver and the walsender are now able to communicate
in a duplex way on the same connection, so while COPY OUT is
in progress from the primary server, the standby server is able to
issue PQputCopyData() to pass the transaction IDs that were seen
with XLOG_XACT_COMMIT or XLOG_XACT_PREPARE
signatures. I did by adding a new protocol message type, with letter
'x' that's only acknowledged by the walsender process. The regular
backend was intentionally unchanged so an SQL client gets a protocol
error. A new libpq call called PQsetDuplexCopy() which sends this
new message before sending START_REPLICATION. The primary
makes a note of it in the walsender process' entry.

I had to move the TransactionIdLatest(xid, nchildren, children) call
that computes latestXid earlier in RecordTransactionCommit(), so
it's in the critical section now, just before the
XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT, rdata)
call. Otherwise, there was a race condition between the primary
and the standby server, where the standby server might have seen
the XLOG_XACT_COMMIT record for some XIDs before the
transaction in the primary server marked itself waiting for this XID,
resulting in stuck transactions.

I have added 3 new options, two GUCs in postgresql.conf and one
setting in recovery.conf. These options are:

1. min_sync_replication_clients = N

where N is the number of reports for a given transaction before it's
released as committed synchronously. 0 means completely asynchronous,
the value is maximized by the value of max_wal_senders. Anything
in between 0 and max_wal_senders means different levels of partially
synchronous replication.

2. strict_sync_replication = boolean

where the expected number of synchronous reports from standby
servers is further limited to the actual number of connected synchronous
standby servers if the value of this GUC is false. This means that if
no standby servers are connected yet then the replication is asynchronous
and transactions are allowed to finish without waiting for synchronous
reports. If the value of this GUC is true, then transactions wait until
enough synchronous standbys connect and report back.

3. synchronous_slave = boolean (in recovery.conf)

this instructs the standby server to tell the primary that it's a
synchronous
replication server and it will send the committed XIDs back to the primary.

I also added a contrib module for monitoring the synchronous replication
but it abuses the procarray.c code by exposing the procArray pointer
which is ugly. It's either need to be abandoned or moved to core if or when
this code is discussed enough. :-)

Best regards,
Zoltán Böszörményi

--
Bible has answers for everything. Proof:
"But let your communication be, Yea, yea; Nay, nay: for whatsoever is more
than these cometh of evil." (Matthew 5:37) - basics of digital technology.
"May your kingdom come" - superficial description of plate tectonics

----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
http://www.postgresql.at/

Attachment

pg91-syncrep-15-ctxdiff.patch

Re: Synchronous replication patch built on SR

From

Fujii Masao

Date:

14 May 2010, 08:56:22

2010/4/29 Boszormenyi Zoltan <zb@cybertec.at>:
> attached is a patch that does $SUBJECT, we are submitting it for 9.1.
> I have updated it to today's CVS after the "wal_level" GUC went in.

I'm planning to create the synchronous replication patch for 9.0, too.
My design is outlined in the wiki. Let's work together to do the design
of it.
http://wiki.postgresql.org/wiki/Streaming_Replication#Synchronization_capability

The log-shipping replication has some synchronization levels as follows.
Which are you going to work on?
   The transaction commit on the master   #1 doesn't wait for replication (already suppored in 9.0)   #2 waits for WAL
tobe received by the standby   #3 waits for WAL to be received and flushed by the standby   #4 waits for WAL to be
received,flushed and replayed by the standby   ..etc?
 

I'm planning to add #2 and #3 into 9.1. #4 is useful but is outside
the scope of my development for at least 9.1. In #4, read-only query
can easily block recovery by the lock conflict and make the
transaction commit on the master get stuck. This problem is difficult
to be addressed within 9.1, I think. But the design and implementation
of #2 and #3 need to be easily extensible to #4.

> How does it work?
>
> First, the walreceiver and the walsender are now able to communicate
> in a duplex way on the same connection, so while COPY OUT is
> in progress from the primary server, the standby server is able to
> issue PQputCopyData() to pass the transaction IDs that were seen
> with XLOG_XACT_COMMIT or XLOG_XACT_PREPARE
> signatures. I did by adding a new protocol message type, with letter
> 'x' that's only acknowledged by the walsender process. The regular
> backend was intentionally unchanged so an SQL client gets a protocol
> error. A new libpq call called PQsetDuplexCopy() which sends this
> new message before sending START_REPLICATION. The primary
> makes a note of it in the walsender process' entry.
>
> I had to move the TransactionIdLatest(xid, nchildren, children) call
> that computes latestXid earlier in RecordTransactionCommit(), so
> it's in the critical section now, just before the
> XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT, rdata)
> call. Otherwise, there was a race condition between the primary
> and the standby server, where the standby server might have seen
> the XLOG_XACT_COMMIT record for some XIDs before the
> transaction in the primary server marked itself waiting for this XID,
> resulting in stuck transactions.

You seem to have chosen #4 as synchronization level. Right?

In your design, the transaction commit on the master waits for its XID
to be read from the XLOG_XACT_COMMIT record and replied by the standby.
Right? This design seems not to be extensible to #2 and #3 since
walreceiver cannot read XID from the XLOG_XACT_COMMIT record. How about
using LSN instead of XID? That is, the transaction commit waits until
the standby has reached its LSN. LSN is more easy-used for walreceiver
and startup process, I think.

What if the "synchronous" standby starts up from the very old backup?
The transaction on the master needs to wait until a large amount of
outstanding WAL has been applied? I think that synchronous replication
should start with *asynchronous* replication, and should switch to the
sync level after the gap between servers has become enough small.
What's your opinion?

> I have added 3 new options, two GUCs in postgresql.conf and one
> setting in recovery.conf. These options are:
>
> 1. min_sync_replication_clients = N
>
> where N is the number of reports for a given transaction before it's
> released as committed synchronously. 0 means completely asynchronous,
> the value is maximized by the value of max_wal_senders. Anything
> in between 0 and max_wal_senders means different levels of partially
> synchronous replication.
>
> 2. strict_sync_replication = boolean
>
> where the expected number of synchronous reports from standby
> servers is further limited to the actual number of connected synchronous
> standby servers if the value of this GUC is false. This means that if
> no standby servers are connected yet then the replication is asynchronous
> and transactions are allowed to finish without waiting for synchronous
> reports. If the value of this GUC is true, then transactions wait until
> enough synchronous standbys connect and report back.

Why are these options necessary?

Can these options cover more than three synchronization levels?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous replication patch built on SR

From

Boszormenyi Zoltan

Date:

14 May 2010, 10:34:08

Fujii Masao írta:
> 2010/4/29 Boszormenyi Zoltan <zb@cybertec.at>:
>   
>> attached is a patch that does $SUBJECT, we are submitting it for 9.1.
>> I have updated it to today's CVS after the "wal_level" GUC went in.
>>     
>
> I'm planning to create the synchronous replication patch for 9.0, too.
> My design is outlined in the wiki. Let's work together to do the design
> of it.
> http://wiki.postgresql.org/wiki/Streaming_Replication#Synchronization_capability
>
> The log-shipping replication has some synchronization levels as follows.
> Which are you going to work on?
>
>     The transaction commit on the master
>     #1 doesn't wait for replication (already suppored in 9.0)
>     #2 waits for WAL to be received by the standby
>     #3 waits for WAL to be received and flushed by the standby
>     #4 waits for WAL to be received, flushed and replayed by the standby
>     ..etc?
>
> I'm planning to add #2 and #3 into 9.1. #4 is useful but is outside
> the scope of my development for at least 9.1. In #4, read-only query
> can easily block recovery by the lock conflict and make the
> transaction commit on the master get stuck. This problem is difficult
> to be addressed within 9.1, I think. But the design and implementation
> of #2 and #3 need to be easily extensible to #4.
>
>   
>> How does it work?
>>
>> First, the walreceiver and the walsender are now able to communicate
>> in a duplex way on the same connection, so while COPY OUT is
>> in progress from the primary server, the standby server is able to
>> issue PQputCopyData() to pass the transaction IDs that were seen
>> with XLOG_XACT_COMMIT or XLOG_XACT_PREPARE
>> signatures. I did by adding a new protocol message type, with letter
>> 'x' that's only acknowledged by the walsender process. The regular
>> backend was intentionally unchanged so an SQL client gets a protocol
>> error. A new libpq call called PQsetDuplexCopy() which sends this
>> new message before sending START_REPLICATION. The primary
>> makes a note of it in the walsender process' entry.
>>
>> I had to move the TransactionIdLatest(xid, nchildren, children) call
>> that computes latestXid earlier in RecordTransactionCommit(), so
>> it's in the critical section now, just before the
>> XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT, rdata)
>> call. Otherwise, there was a race condition between the primary
>> and the standby server, where the standby server might have seen
>> the XLOG_XACT_COMMIT record for some XIDs before the
>> transaction in the primary server marked itself waiting for this XID,
>> resulting in stuck transactions.
>>     
>
> You seem to have chosen #4 as synchronization level. Right?
>   

Yes.

> In your design, the transaction commit on the master waits for its XID
> to be read from the XLOG_XACT_COMMIT record and replied by the standby.
> Right? This design seems not to be extensible to #2 and #3 since
> walreceiver cannot read XID from the XLOG_XACT_COMMIT record.

Yes, this was my problem, too. I would have had to
implement a custom interpreter into walreceiver to
process the WAL records and extract the XIDs.

But at least the supporting details, i.e. not opening another
connection, instead being able to do duplex COPY operations in
a server-acknowledged way is acceptable, no? :-)

>  How about
> using LSN instead of XID? That is, the transaction commit waits until
> the standby has reached its LSN. LSN is more easy-used for walreceiver
> and startup process, I think.
>   

Indeed, using the LSN seems to be more appropriate for
the walreceiver, but how would you extract the information
that a certain LSN means a COMMITted transaction? Or
we could release a locked transaction in case the master receives
an LSN greater than or equal to the transaction's own LSN?

Sending back all the LSNs in case of long transactions would
increase the network traffic compared to sending back only the
XIDs, but the amount is not clear for me. What I am more
worried about is the contention on the ProcArrayLock.
XIDs are rarer then LSNs, no?

> What if the "synchronous" standby starts up from the very old backup?
> The transaction on the master needs to wait until a large amount of
> outstanding WAL has been applied? I think that synchronous replication
> should start with *asynchronous* replication, and should switch to the
> sync level after the gap between servers has become enough small.
> What's your opinion?
>   

It's certainly one option, which I think partly addressed
with the "strict_sync_replication" knob below.
If strict_sync_replication = off, then the master doesn't make
its transactions wait for the synchronous reports, and the client(s)
can work through their WALs. IIRC, the walreceiver connects
to the master only very late in the recovery process, no?

It would be nicer if it could be made automatic. I simply thought
that there may be situations where the "strict" behaviour may be
desired. I was thinking about the transactions executed on the
master between the standby startup and walreceiver connection.
Someone may want to ensure the synchronous behaviour
for every xact, no matter the amount of time it needs. Someone
else will prefer synchronous behaviour whenever possible but
also ensure quick enough response time even if standbys aren't
started up yet. This dilemma cried for such a GUC, it cannot be
decided automatically.

>> I have added 3 new options, two GUCs in postgresql.conf and one
>> setting in recovery.conf. These options are:
>>
>> 1. min_sync_replication_clients = N
>>
>> where N is the number of reports for a given transaction before it's
>> released as committed synchronously. 0 means completely asynchronous,
>> the value is maximized by the value of max_wal_senders. Anything
>> in between 0 and max_wal_senders means different levels of partially
>> synchronous replication.
>>
>> 2. strict_sync_replication = boolean
>>
>> where the expected number of synchronous reports from standby
>> servers is further limited to the actual number of connected synchronous
>> standby servers if the value of this GUC is false. This means that if
>> no standby servers are connected yet then the replication is asynchronous
>> and transactions are allowed to finish without waiting for synchronous
>> reports. If the value of this GUC is true, then transactions wait until
>> enough synchronous standbys connect and report back.
>>     
>
> Why are these options necessary?
>
> Can these options cover more than three synchronization levels?
>   

I think I explained it in my mail.

If  min_sync_replication_clients == 0, then the replication is async.
If  min_sync_replication_clients == max_wal_senders then the
replication is fully synchronous.
If 0 < min_sync_replication_clients < max_wal_senders then
the replication is partially synchronous, i.e. the master can wait
only for say, 50% of the clients to report back before it's considered
synchronous and the relevant transactions get released from the wait.

Best regards,
Zoltán Böszörményi

-- 
Bible has answers for everything. Proof:
"But let your communication be, Yea, yea; Nay, nay: for whatsoever is more
than these cometh of evil." (Matthew 5:37) - basics of digital technology.
"May your kingdom come" - superficial description of plate tectonics

----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
http://www.postgresql.at/

Re: Synchronous replication patch built on SR

From

Robert Haas

Date:

14 May 2010, 16:15:45

On Fri, May 14, 2010 at 9:33 AM, Boszormenyi Zoltan <zb@cybertec.at> wrote:
> If  min_sync_replication_clients == 0, then the replication is async.
> If  min_sync_replication_clients == max_wal_senders then the
> replication is fully synchronous.
> If 0 < min_sync_replication_clients < max_wal_senders then
> the replication is partially synchronous, i.e. the master can wait
> only for say, 50% of the clients to report back before it's considered
> synchronous and the relevant transactions get released from the wait.

That's an interesting design and in some ways pretty elegant, but it
rules out some things that people might easily want to do - for
example, synchronous replication to the other server in the same data
center that acts as a backup for the master; and asynchronous
replication to a reporting server located off-site.

One of the things that I think we will probably need/want to change
eventually is the fact that the master has no real knowledge of who
the replication slaves are.  That might be something we want to change
in order to be able to support more configurability.  Inventing syntax
out of whole cloth and leaving semantics to the imagination of the
reader:

CREATE REPLICATION SLAVE reporting_server (mode asynchronous, xid_feedback on);
CREATE REPLICATION SLAVE failover_server (mode synchronous,
xid_feedback off, break_synchrep_timeout 30);

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: Synchronous replication patch built on SR

From

Boszormenyi Zoltan

Date:

15 May 2010, 02:32:16

Robert Haas írta:
> On Fri, May 14, 2010 at 9:33 AM, Boszormenyi Zoltan <zb@cybertec.at> wrote:
>   
>> If  min_sync_replication_clients == 0, then the replication is async.
>> If  min_sync_replication_clients == max_wal_senders then the
>> replication is fully synchronous.
>> If 0 < min_sync_replication_clients < max_wal_senders then
>> the replication is partially synchronous, i.e. the master can wait
>> only for say, 50% of the clients to report back before it's considered
>> synchronous and the relevant transactions get released from the wait.
>>     
>
> That's an interesting design and in some ways pretty elegant, but it
> rules out some things that people might easily want to do - for
> example, synchronous replication to the other server in the same data
> center that acts as a backup for the master; and asynchronous
> replication to a reporting server located off-site.
>   

No, it doesn't. :-) You didn't take into account the third knob
usable in recovery.conf:   synchronous_slave = on/off
The off-site reporting server can be an asynchronous standby,
while the on-site backup server can be synchronous. The only thing
you need to take into account is that min_sync_replication_clients
shouldn't ever exceed your actual number of synchronous standbys.
The setup these three knobs provide is pretty flexible I think.

> One of the things that I think we will probably need/want to change
> eventually is the fact that the master has no real knowledge of who
> the replication slaves are.

The changes I made in my patch partly changes that,
the server still doesn't know "who" the standbys are
but there's a call that returns the number of connected
_synchronous_ standbys.

>   That might be something we want to change
> in order to be able to support more configurability.  Inventing syntax
> out of whole cloth and leaving semantics to the imagination of the
> reader:
>
> CREATE REPLICATION SLAVE reporting_server (mode asynchronous, xid_feedback on);
> CREATE REPLICATION SLAVE failover_server (mode synchronous,
> xid_feedback off, break_synchrep_timeout 30);
>
>   


-- 
Bible has answers for everything. Proof:
"But let your communication be, Yea, yea; Nay, nay: for whatsoever is more
than these cometh of evil." (Matthew 5:37) - basics of digital technology.
"May your kingdom come" - superficial description of plate tectonics

----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
http://www.postgresql.at/

Re: Synchronous replication patch built on SR

From

Heikki Linnakangas

Date:

15 May 2010, 05:00:22

BTW, What I'd like to see as a very first patch first is to change the
current poll loops in walreceiver and walsender to, well, not poll.
That's a requirement for synchronous replication, is very useful on its
own, and requires a some design and implementation effort to get right.
It would be nice to get that out of the way before/during we discuss the
more user-visible behavior.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous replication patch built on SR

From

Simon Riggs

Date:

16 May 2010, 14:25:27

On Fri, 2010-05-14 at 15:15 -0400, Robert Haas wrote:
> On Fri, May 14, 2010 at 9:33 AM, Boszormenyi Zoltan <zb@cybertec.at> wrote:
> > If  min_sync_replication_clients == 0, then the replication is async.
> > If  min_sync_replication_clients == max_wal_senders then the
> > replication is fully synchronous.
> > If 0 < min_sync_replication_clients < max_wal_senders then
> > the replication is partially synchronous, i.e. the master can wait
> > only for say, 50% of the clients to report back before it's considered
> > synchronous and the relevant transactions get released from the wait.
> 
> That's an interesting design and in some ways pretty elegant, but it
> rules out some things that people might easily want to do - for
> example, synchronous replication to the other server in the same data
> center that acts as a backup for the master; and asynchronous
> replication to a reporting server located off-site.

The design above allows the case you mention:
min_sync_replication_clients = 1
max_wal_senders = 2

It works well in failure cases, such as the case where the local backup
server goes down.

It seems exactly what we need to me, though not sure about names.

> One of the things that I think we will probably need/want to change
> eventually is the fact that the master has no real knowledge of who
> the replication slaves are.  That might be something we want to change
> in order to be able to support more configurability.  Inventing syntax
> out of whole cloth and leaving semantics to the imagination of the
> reader:
> 
> CREATE REPLICATION SLAVE reporting_server (mode asynchronous, xid_feedback on);
> CREATE REPLICATION SLAVE failover_server (mode synchronous,
> xid_feedback off, break_synchrep_timeout 30);

I am against labelling servers as synchronous/asynchronous. We've had
this discussion a few times since 2008.

There is significant advantage in having the user specify the level of
robustness, so that it can vary from transaction to transaction, just as
already happens at commit. That way the user gets to say what happens.
Look for threads on "transaction controlled robustness".

As alluded to above, if you label the servers you also need to say what
happens when one or more of them are down. e.g. "synchronous to B AND
async to C, except when B is not available, in which case make C
synchronous". With N servers, you end up needing to specify O(N^2) rules
for what happens, so it only works neatly for 2, maybe 3 servers.

-- Simon Riggs           www.2ndQuadrant.com

Re: Synchronous replication patch built on SR

From

Fujii Masao

Date:

18 May 2010, 08:30:58

Thanks for your reply!

On Fri, May 14, 2010 at 10:33 PM, Boszormenyi Zoltan <zb@cybertec.at> wrote:
>> In your design, the transaction commit on the master waits for its XID
>> to be read from the XLOG_XACT_COMMIT record and replied by the standby.
>> Right? This design seems not to be extensible to #2 and #3 since
>> walreceiver cannot read XID from the XLOG_XACT_COMMIT record.
>
> Yes, this was my problem, too. I would have had to
> implement a custom interpreter into walreceiver to
> process the WAL records and extract the XIDs.

Isn't reading the same WAL twice (by walreceiver and startup process)
inefficient? In synchronous replication, the overhead of walreceiver
directly affects the performance of the master. We should not assign
such a hard work to walreceiver, I think.

> But at least the supporting details, i.e. not opening another
> connection, instead being able to do duplex COPY operations in
> a server-acknowledged way is acceptable, no? :-)

Though I might not understand your point (sorry), it's OK for the standby
to send the reply to the master by using CopyData message. Currently
PQputCopyData() cannot be executed in COPY OUT, but we can relax
that.

>>  How about
>> using LSN instead of XID? That is, the transaction commit waits until
>> the standby has reached its LSN. LSN is more easy-used for walreceiver
>> and startup process, I think.
>>
>
> Indeed, using the LSN seems to be more appropriate for
> the walreceiver, but how would you extract the information
> that a certain LSN means a COMMITted transaction? Or
> we could release a locked transaction in case the master receives
> an LSN greater than or equal to the transaction's own LSN?

Yep, we can ensure that the transaction has been replicated by
comparing its own LSN with the smallest LSN in the latest LSNs
of each connected "synchronous" standby.

> Sending back all the LSNs in case of long transactions would
> increase the network traffic compared to sending back only the
> XIDs, but the amount is not clear for me. What I am more
> worried about is the contention on the ProcArrayLock.
> XIDs are rarer then LSNs, no?

No. For example, when WAL data sent by walsender at a time
has two XLOG_XACT_COMMIT records, in XID approach, walreceiver
would need to send two replies. OTOH, in LSN approach, only
one reply which indicates the last received location would
need to be sent.

>> What if the "synchronous" standby starts up from the very old backup?
>> The transaction on the master needs to wait until a large amount of
>> outstanding WAL has been applied? I think that synchronous replication
>> should start with *asynchronous* replication, and should switch to the
>> sync level after the gap between servers has become enough small.
>> What's your opinion?
>>
>
> It's certainly one option, which I think partly addressed
> with the "strict_sync_replication" knob below.
> If strict_sync_replication = off, then the master doesn't make
> its transactions wait for the synchronous reports, and the client(s)
> can work through their WALs. IIRC, the walreceiver connects
> to the master only very late in the recovery process, no?

No, the master might have a large number of WAL files which
the standby doesn't have.

>>> I have added 3 new options, two GUCs in postgresql.conf and one
>>> setting in recovery.conf. These options are:
>>>
>>> 1. min_sync_replication_clients = N
>>>
>>> where N is the number of reports for a given transaction before it's
>>> released as committed synchronously. 0 means completely asynchronous,
>>> the value is maximized by the value of max_wal_senders. Anything
>>> in between 0 and max_wal_senders means different levels of partially
>>> synchronous replication.
>>>
>>> 2. strict_sync_replication = boolean
>>>
>>> where the expected number of synchronous reports from standby
>>> servers is further limited to the actual number of connected synchronous
>>> standby servers if the value of this GUC is false. This means that if
>>> no standby servers are connected yet then the replication is asynchronous
>>> and transactions are allowed to finish without waiting for synchronous
>>> reports. If the value of this GUC is true, then transactions wait until
>>> enough synchronous standbys connect and report back.
>>>
>>
>> Why are these options necessary?
>>
>> Can these options cover more than three synchronization levels?
>>
>
> I think I explained it in my mail.
>
> If  min_sync_replication_clients == 0, then the replication is async.
> If  min_sync_replication_clients == max_wal_senders then the
> replication is fully synchronous.
> If 0 < min_sync_replication_clients < max_wal_senders then
> the replication is partially synchronous, i.e. the master can wait
> only for say, 50% of the clients to report back before it's considered
> synchronous and the relevant transactions get released from the wait.

Seems s/min_sync_replication_clients/max_sync_replication_clients

min_sync_replication_clients is required to prevent outside attacker
from connecting to the master as "synchronous" standby, and degrading
the performance on the master? Other usecase?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous replication patch built on SR

From

Fujii Masao

Date:

18 May 2010, 08:42:09

On Sat, May 15, 2010 at 4:59 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> BTW, What I'd like to see as a very first patch first is to change the
> current poll loops in walreceiver and walsender to, well, not poll.
> That's a requirement for synchronous replication, is very useful on its
> own, and requires a some design and implementation effort to get right.
> It would be nice to get that out of the way before/during we discuss the
> more user-visible behavior.

Yeah, we should wake up the walesender from sleep to send WAL data
as soon as it's flushed. But why do we need to change the loop of
walreceiver? Or you mean changing the poll loop in the startup process?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous replication patch built on SR

From

Heikki Linnakangas

Date:

18 May 2010, 17:03:54

On 18/05/10 07:41, Fujii Masao wrote:
> On Sat, May 15, 2010 at 4:59 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> BTW, What I'd like to see as a very first patch first is to change the
>> current poll loops in walreceiver and walsender to, well, not poll.
>> That's a requirement for synchronous replication, is very useful on its
>> own, and requires a some design and implementation effort to get right.
>> It would be nice to get that out of the way before/during we discuss the
>> more user-visible behavior.
>
> Yeah, we should wake up the walesender from sleep to send WAL data
> as soon as it's flushed. But why do we need to change the loop of
> walreceiver? Or you mean changing the poll loop in the startup process?

Yeah, changing the poll loop in the startup process is what I meant.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Synchronous replication patch built on SR

From

Boszormenyi Zoltan

Date:

19 May 2010, 05:41:53

Fujii Masao írta:
> Thanks for your reply!
>
> On Fri, May 14, 2010 at 10:33 PM, Boszormenyi Zoltan <zb@cybertec.at> wrote:
>   
>>> In your design, the transaction commit on the master waits for its XID
>>> to be read from the XLOG_XACT_COMMIT record and replied by the standby.
>>> Right? This design seems not to be extensible to #2 and #3 since
>>> walreceiver cannot read XID from the XLOG_XACT_COMMIT record.
>>>       
>> Yes, this was my problem, too. I would have had to
>> implement a custom interpreter into walreceiver to
>> process the WAL records and extract the XIDs.
>>     
>
> Isn't reading the same WAL twice (by walreceiver and startup process)
> inefficient?

Yes, and I didn't implement that because it's inefficient.
I implemented a minimal communication between
StartupXLOG() and the walreceiver.

>  In synchronous replication, the overhead of walreceiver
> directly affects the performance of the master. We should not assign
> such a hard work to walreceiver, I think.
>   

Exactly.

>> But at least the supporting details, i.e. not opening another
>> connection, instead being able to do duplex COPY operations in
>> a server-acknowledged way is acceptable, no? :-)
>>     
>
> Though I might not understand your point (sorry), it's OK for the standby
> to send the reply to the master by using CopyData message.

I thought about the same.

>  Currently
> PQputCopyData() cannot be executed in COPY OUT, but we can relax
> that.
>   

And I implemented just that, in a way that upon walreceiver startup
it sends a new protocol message to the walsender by calling
PQsetDuplexCopy() (see my patch) and the walsender response is ACK.
This protocol message is intentionally not handled by the normal
backend, so plain libpq clients cannot mess up their COPY streams.

>>>  How about
>>> using LSN instead of XID? That is, the transaction commit waits until
>>> the standby has reached its LSN. LSN is more easy-used for walreceiver
>>> and startup process, I think.
>>>
>>>       
>> Indeed, using the LSN seems to be more appropriate for
>> the walreceiver, but how would you extract the information
>> that a certain LSN means a COMMITted transaction? Or
>> we could release a locked transaction in case the master receives
>> an LSN greater than or equal to the transaction's own LSN?
>>     
>
> Yep, we can ensure that the transaction has been replicated by
> comparing its own LSN with the smallest LSN in the latest LSNs
> of each connected "synchronous" standby.
>
>   
>> Sending back all the LSNs in case of long transactions would
>> increase the network traffic compared to sending back only the
>> XIDs, but the amount is not clear for me. What I am more
>> worried about is the contention on the ProcArrayLock.
>> XIDs are rarer then LSNs, no?
>>     
>
> No. For example, when WAL data sent by walsender at a time
> has two XLOG_XACT_COMMIT records, in XID approach, walreceiver
> would need to send two replies. OTOH, in LSN approach, only
> one reply which indicates the last received location would
> need to be sent.
>   

I see.

>>> What if the "synchronous" standby starts up from the very old backup?
>>> The transaction on the master needs to wait until a large amount of
>>> outstanding WAL has been applied? I think that synchronous replication
>>> should start with *asynchronous* replication, and should switch to the
>>> sync level after the gap between servers has become enough small.
>>> What's your opinion?
>>>
>>>       
>> It's certainly one option, which I think partly addressed
>> with the "strict_sync_replication" knob below.
>> If strict_sync_replication = off, then the master doesn't make
>> its transactions wait for the synchronous reports, and the client(s)
>> can work through their WALs. IIRC, the walreceiver connects
>> to the master only very late in the recovery process, no?
>>     
>
> No, the master might have a large number of WAL files which
> the standby doesn't have.
>   

We can change the walreceiver so it sends similarly encapsulated
messages as the walsender does. In our patch, the walreceiver
currently sends the raw XIDs. If we add a minimal protocol
encapsulation, we can distinguish between the XIDs (or later LSNs)
and the "mark me synchronous from now on" message.

The only problem is: what should be the point when such a client
becomes synchronous from the master's POV, so the XID/LSN reports
will count and transactions are made to wait for this client?

As a side note, the async walreceivers' behaviour should be kept
so they don't send anything back and the message that
PQsetDuplexCopy() sends to the master would then only
prepare the walsender that its client will become synchronous
in the near future.

>>>> I have added 3 new options, two GUCs in postgresql.conf and one
>>>> setting in recovery.conf. These options are:
>>>>
>>>> 1. min_sync_replication_clients = N
>>>>
>>>> where N is the number of reports for a given transaction before it's
>>>> released as committed synchronously. 0 means completely asynchronous,
>>>> the value is maximized by the value of max_wal_senders. Anything
>>>> in between 0 and max_wal_senders means different levels of partially
>>>> synchronous replication.
>>>>
>>>> 2. strict_sync_replication = boolean
>>>>
>>>> where the expected number of synchronous reports from standby
>>>> servers is further limited to the actual number of connected synchronous
>>>> standby servers if the value of this GUC is false. This means that if
>>>> no standby servers are connected yet then the replication is asynchronous
>>>> and transactions are allowed to finish without waiting for synchronous
>>>> reports. If the value of this GUC is true, then transactions wait until
>>>> enough synchronous standbys connect and report back.
>>>>
>>>>         
>>> Why are these options necessary?
>>>
>>> Can these options cover more than three synchronization levels?
>>>
>>>       
>> I think I explained it in my mail.
>>
>> If  min_sync_replication_clients == 0, then the replication is async.
>> If  min_sync_replication_clients == max_wal_senders then the
>> replication is fully synchronous.
>> If 0 < min_sync_replication_clients < max_wal_senders then
>> the replication is partially synchronous, i.e. the master can wait
>> only for say, 50% of the clients to report back before it's considered
>> synchronous and the relevant transactions get released from the wait.
>>     
>
> Seems s/min_sync_replication_clients/max_sync_replication_clients
>   

No, "min" is indicating the minimum number of walreceiver reports
needed before a transaction can be released from under the waiting.
The other reports coming from walreceivers are ignored.

> min_sync_replication_clients is required to prevent outside attacker
> from connecting to the master as "synchronous" standby, and degrading
> the performance on the master?

???

Properly configured pg_hba.conf prevents outside attackers
to connect as replication clients, no?

>  Other usecase?
>
> Regards,
>
>   


-- 
Bible has answers for everything. Proof:
"But let your communication be, Yea, yea; Nay, nay: for whatsoever is more
than these cometh of evil." (Matthew 5:37) - basics of digital technology.
"May your kingdom come" - superficial description of plate tectonics

----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
http://www.postgresql.at/

Re: Synchronous replication patch built on SR

From

Fujii Masao

Date:

19 May 2010, 07:53:38

On Wed, May 19, 2010 at 5:41 PM, Boszormenyi Zoltan <zb@cybertec.at> wrote:
>> Isn't reading the same WAL twice (by walreceiver and startup process)
>> inefficient?
>
> Yes, and I didn't implement that because it's inefficient.

So I'd like to propose to use LSN instead of XID since LSN can
be easily handled by both walreceiver and startup process.

>>  Currently
>> PQputCopyData() cannot be executed in COPY OUT, but we can relax
>> that.
>>
>
> And I implemented just that, in a way that upon walreceiver startup
> it sends a new protocol message to the walsender by calling
> PQsetDuplexCopy() (see my patch) and the walsender response is ACK.
> This protocol message is intentionally not handled by the normal
> backend, so plain libpq clients cannot mess up their COPY streams.

The newly-introduced message type "Set Duplex Copy" is really required?
I think that the standby can send its replication mode to the master
via Query or CopyData message, which are already used in SR. For example,
how about including the mode in the handshake message "START_REPLICATION"?
If we do that, we would not need to introduce new libpq function
PQsetDuplexCopy(). BTW, I often got the complaints about adding
new libpq function when I implemented SR ;)

In the patch, PQputCopyData() checks the newly-introduced pg_conn field
"duplexCopy". Instead, how about checking the existing field "replication"?
Or we can just allow PQputCopyData() to go even in COPY OUT state.

> We can change the walreceiver so it sends similarly encapsulated
> messages as the walsender does. In our patch, the walreceiver
> currently sends the raw XIDs. If we add a minimal protocol
> encapsulation, we can distinguish between the XIDs (or later LSNs)
> and the "mark me synchronous from now on" message.
>
> The only problem is: what should be the point when such a client
> becomes synchronous from the master's POV, so the XID/LSN reports
> will count and transactions are made to wait for this client?

One idea is to switch to "sync" when the gap of LSN becomes less
than or equal to XLOG_SEG_SIZE (currently 8MB). That is, walsender
calculates the gap from the current write WAL location on the master
and the last receive/flush/replay location on the standby. And if
the gap <= XLOG_SEG_SIZE, it instructs backends to wait for
replication from then on.

> As a side note, the async walreceivers' behaviour should be kept
> so they don't send anything back and the message that
> PQsetDuplexCopy() sends to the master would then only
> prepare the walsender that its client will become synchronous
> in the near future.

I agree that walreceiver should send no replication ack if "async"
mode is chosen. OTOH, in "sync" case, walreceiver should always
send ack even if the gap is large and the master doesn't wait for
replication yet. As mentioned above, walsender needs to calculate
the gap from the ack.

>> Seems s/min_sync_replication_clients/max_sync_replication_clients
>>
>
> No, "min" is indicating the minimum number of walreceiver reports
> needed before a transaction can be released from under the waiting.
> The other reports coming from walreceivers are ignored.

Hmm... when min_sync_replication_clients = 2 and there are three
"synchronous" standbys, the master waits for only two standbys?

The standby which the master ignores is fixed? or dynamically (or
randomly) changed?

>> min_sync_replication_clients is required to prevent outside attacker
>> from connecting to the master as "synchronous" standby, and degrading
>> the performance on the master?
>
> ???
>
> Properly configured pg_hba.conf prevents outside attackers
> to connect as replication clients, no?

Yes :)

I'd like to just know the use case of min_sync_replication_clients.
Sorry, I've not understood yet how useful this option is.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synchronous replication patch built on SR

From

Boszormenyi Zoltan

Date:

19 May 2010, 09:59:16

Fujii Masao írta:
> On Wed, May 19, 2010 at 5:41 PM, Boszormenyi Zoltan <zb@cybertec.at> wrote:
>   
>>> Isn't reading the same WAL twice (by walreceiver and startup process)
>>> inefficient?
>>>       
>> Yes, and I didn't implement that because it's inefficient.
>>     
>
> So I'd like to propose to use LSN instead of XID since LSN can
> be easily handled by both walreceiver and startup process.
>   

OK, I will look into it replacing XIDs with LSNs.

>>>  Currently
>>> PQputCopyData() cannot be executed in COPY OUT, but we can relax
>>> that.
>>>
>>>       
>> And I implemented just that, in a way that upon walreceiver startup
>> it sends a new protocol message to the walsender by calling
>> PQsetDuplexCopy() (see my patch) and the walsender response is ACK.
>> This protocol message is intentionally not handled by the normal
>> backend, so plain libpq clients cannot mess up their COPY streams.
>>     
>
> The newly-introduced message type "Set Duplex Copy" is really required?
> I think that the standby can send its replication mode to the master
> via Query or CopyData message, which are already used in SR. For example,
> how about including the mode in the handshake message "START_REPLICATION"?
> If we do that, we would not need to introduce new libpq function
> PQsetDuplexCopy(). BTW, I often got the complaints about adding
> new libpq function when I implemented SR ;)
>   

:-)

> In the patch, PQputCopyData() checks the newly-introduced pg_conn field
> "duplexCopy". Instead, how about checking the existing field "replication"?
>   

I didn't see there was such a new field. (looking...) I can see now,
it was added in the middle of the structure. Ok, we can then use it
to allow duplex COPY instead of my new field. I suppose it's non-NULL
if replication is on, right? Then the extra call is not needed then.

> Or we can just allow PQputCopyData() to go even in COPY OUT state.
>   

I think this may not be too useful for SQL clients, but who knows? :-)
Use cases, anyone?

>> We can change the walreceiver so it sends similarly encapsulated
>> messages as the walsender does. In our patch, the walreceiver
>> currently sends the raw XIDs. If we add a minimal protocol
>> encapsulation, we can distinguish between the XIDs (or later LSNs)
>> and the "mark me synchronous from now on" message.
>>
>> The only problem is: what should be the point when such a client
>> becomes synchronous from the master's POV, so the XID/LSN reports
>> will count and transactions are made to wait for this client?
>>     
>
> One idea is to switch to "sync" when the gap of LSN becomes less
> than or equal to XLOG_SEG_SIZE (currently 8MB). That is, walsender
> calculates the gap from the current write WAL location on the master
> and the last receive/flush/replay location on the standby. And if
> the gap <= XLOG_SEG_SIZE, it instructs backends to wait for
> replication from then on.
>   

This is a sensible idea.

>> As a side note, the async walreceivers' behaviour should be kept
>> so they don't send anything back and the message that
>> PQsetDuplexCopy() sends to the master would then only
>> prepare the walsender that its client will become synchronous
>> in the near future.
>>     
>
> I agree that walreceiver should send no replication ack if "async"
> mode is chosen. OTOH, in "sync" case, walreceiver should always
> send ack even if the gap is large and the master doesn't wait for
> replication yet. As mentioned above, walsender needs to calculate
> the gap from the ack.
>   

Agreed.

>>> Seems s/min_sync_replication_clients/max_sync_replication_clients
>>>
>>>       
>> No, "min" is indicating the minimum number of walreceiver reports
>> needed before a transaction can be released from under the waiting.
>> The other reports coming from walreceivers are ignored.
>>     
>
> Hmm... when min_sync_replication_clients = 2 and there are three
> "synchronous" standbys, the master waits for only two standbys?
>   

Yes. This is the idea, "partially synchronous replication".
I heard anecdotes about replication solutions where say
ensuring that (say) if at least 50% of the machines across the
whole cluster report back synchronously then the transaction
is considered replicated "good enough".

> The standby which the master ignores is fixed? or dynamically (or
> randomly) changed?
>   

It may be randomly changed, depending on who send the reports
first. The replication servers themselves may get very busy with
large queries or they may be loaded by some other ways and
be somewhat late in processing the WAL stream. The less loaded
servers answer first, and the transaction is considered properly
replicated.

>>> min_sync_replication_clients is required to prevent outside attacker
>>> from connecting to the master as "synchronous" standby, and degrading
>>> the performance on the master?
>>>       
>> ???
>>
>> Properly configured pg_hba.conf prevents outside attackers
>> to connect as replication clients, no?
>>     
>
> Yes :)
>
> I'd like to just know the use case of min_sync_replication_clients.
> Sorry, I've not understood yet how useful this option is.
>   

I hope I answered it. :-)

Best regards,
Zoltán Böszörményi

-- 
Bible has answers for everything. Proof:
"But let your communication be, Yea, yea; Nay, nay: for whatsoever is more
than these cometh of evil." (Matthew 5:37) - basics of digital technology.
"May your kingdom come" - superficial description of plate tectonics

----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
http://www.postgresql.at/

Re: Synchronous replication patch built on SR

From

Fujii Masao

Date:

20 May 2010, 00:20:25

On Wed, May 19, 2010 at 9:58 PM, Boszormenyi Zoltan <zb@cybertec.at> wrote:
>> In the patch, PQputCopyData() checks the newly-introduced pg_conn field
>> "duplexCopy". Instead, how about checking the existing field "replication"?
>
> I didn't see there was such a new field. (looking...) I can see now,
> it was added in the middle of the structure. Ok, we can then use it
> to allow duplex COPY instead of my new field. I suppose it's non-NULL
> if replication is on, right? Then the extra call is not needed then.

Right. Usually the first byte of the pg_conn field seems to be also
checked as follows, but I'm not sure if that is valuable for this case.
if (conn->replication && conn->replication[0])

>> Or we can just allow PQputCopyData() to go even in COPY OUT state.
>
> I think this may not be too useful for SQL clients, but who knows? :-)
> Use cases, anyone?

It's for only replication.

>> Hmm... when min_sync_replication_clients = 2 and there are three
>> "synchronous" standbys, the master waits for only two standbys?
>>
>
> Yes. This is the idea, "partially synchronous replication".
> I heard anecdotes about replication solutions where say
> ensuring that (say) if at least 50% of the machines across the
> whole cluster report back synchronously then the transaction
> is considered replicated "good enough".

Oh, I got. I heard such a use case for the first time.

We seem to have many ideas about the knobs to control synchronization
levels, and would need to clarify which ones to be implemented for 9.1.

>> I'd like to just know the use case of min_sync_replication_clients.
>> Sorry, I've not understood yet how useful this option is.
>>
>
> I hope I answered it. :-)

Yep. Thanks!

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center