Thread: Synchronous replication patch built on SR
Hi, attached is a patch that does $SUBJECT, we are submitting it for 9.1. I have updated it to today's CVS after the "wal_level" GUC went in. How does it work? First, the walreceiver and the walsender are now able to communicate in a duplex way on the same connection, so while COPY OUT is in progress from the primary server, the standby server is able to issue PQputCopyData() to pass the transaction IDs that were seen with XLOG_XACT_COMMIT or XLOG_XACT_PREPARE signatures. I did by adding a new protocol message type, with letter 'x' that's only acknowledged by the walsender process. The regular backend was intentionally unchanged so an SQL client gets a protocol error. A new libpq call called PQsetDuplexCopy() which sends this new message before sending START_REPLICATION. The primary makes a note of it in the walsender process' entry. I had to move the TransactionIdLatest(xid, nchildren, children) call that computes latestXid earlier in RecordTransactionCommit(), so it's in the critical section now, just before the XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT, rdata) call. Otherwise, there was a race condition between the primary and the standby server, where the standby server might have seen the XLOG_XACT_COMMIT record for some XIDs before the transaction in the primary server marked itself waiting for this XID, resulting in stuck transactions. I have added 3 new options, two GUCs in postgresql.conf and one setting in recovery.conf. These options are: 1. min_sync_replication_clients = N where N is the number of reports for a given transaction before it's released as committed synchronously. 0 means completely asynchronous, the value is maximized by the value of max_wal_senders. Anything in between 0 and max_wal_senders means different levels of partially synchronous replication. 2. strict_sync_replication = boolean where the expected number of synchronous reports from standby servers is further limited to the actual number of connected synchronous standby servers if the value of this GUC is false. This means that if no standby servers are connected yet then the replication is asynchronous and transactions are allowed to finish without waiting for synchronous reports. If the value of this GUC is true, then transactions wait until enough synchronous standbys connect and report back. 3. synchronous_slave = boolean (in recovery.conf) this instructs the standby server to tell the primary that it's a synchronous replication server and it will send the committed XIDs back to the primary. I also added a contrib module for monitoring the synchronous replication but it abuses the procarray.c code by exposing the procArray pointer which is ugly. It's either need to be abandoned or moved to core if or when this code is discussed enough. :-) Best regards, Zoltán Böszörményi -- Bible has answers for everything. Proof: "But let your communication be, Yea, yea; Nay, nay: for whatsoever is more than these cometh of evil." (Matthew 5:37) - basics of digital technology. "May your kingdom come" - superficial description of plate tectonics ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH http://www.postgresql.at/
Attachment
2010/4/29 Boszormenyi Zoltan <zb@cybertec.at>: > attached is a patch that does $SUBJECT, we are submitting it for 9.1. > I have updated it to today's CVS after the "wal_level" GUC went in. I'm planning to create the synchronous replication patch for 9.0, too. My design is outlined in the wiki. Let's work together to do the design of it. http://wiki.postgresql.org/wiki/Streaming_Replication#Synchronization_capability The log-shipping replication has some synchronization levels as follows. Which are you going to work on? The transaction commit on the master #1 doesn't wait for replication (already suppored in 9.0) #2 waits for WAL tobe received by the standby #3 waits for WAL to be received and flushed by the standby #4 waits for WAL to be received,flushed and replayed by the standby ..etc? I'm planning to add #2 and #3 into 9.1. #4 is useful but is outside the scope of my development for at least 9.1. In #4, read-only query can easily block recovery by the lock conflict and make the transaction commit on the master get stuck. This problem is difficult to be addressed within 9.1, I think. But the design and implementation of #2 and #3 need to be easily extensible to #4. > How does it work? > > First, the walreceiver and the walsender are now able to communicate > in a duplex way on the same connection, so while COPY OUT is > in progress from the primary server, the standby server is able to > issue PQputCopyData() to pass the transaction IDs that were seen > with XLOG_XACT_COMMIT or XLOG_XACT_PREPARE > signatures. I did by adding a new protocol message type, with letter > 'x' that's only acknowledged by the walsender process. The regular > backend was intentionally unchanged so an SQL client gets a protocol > error. A new libpq call called PQsetDuplexCopy() which sends this > new message before sending START_REPLICATION. The primary > makes a note of it in the walsender process' entry. > > I had to move the TransactionIdLatest(xid, nchildren, children) call > that computes latestXid earlier in RecordTransactionCommit(), so > it's in the critical section now, just before the > XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT, rdata) > call. Otherwise, there was a race condition between the primary > and the standby server, where the standby server might have seen > the XLOG_XACT_COMMIT record for some XIDs before the > transaction in the primary server marked itself waiting for this XID, > resulting in stuck transactions. You seem to have chosen #4 as synchronization level. Right? In your design, the transaction commit on the master waits for its XID to be read from the XLOG_XACT_COMMIT record and replied by the standby. Right? This design seems not to be extensible to #2 and #3 since walreceiver cannot read XID from the XLOG_XACT_COMMIT record. How about using LSN instead of XID? That is, the transaction commit waits until the standby has reached its LSN. LSN is more easy-used for walreceiver and startup process, I think. What if the "synchronous" standby starts up from the very old backup? The transaction on the master needs to wait until a large amount of outstanding WAL has been applied? I think that synchronous replication should start with *asynchronous* replication, and should switch to the sync level after the gap between servers has become enough small. What's your opinion? > I have added 3 new options, two GUCs in postgresql.conf and one > setting in recovery.conf. These options are: > > 1. min_sync_replication_clients = N > > where N is the number of reports for a given transaction before it's > released as committed synchronously. 0 means completely asynchronous, > the value is maximized by the value of max_wal_senders. Anything > in between 0 and max_wal_senders means different levels of partially > synchronous replication. > > 2. strict_sync_replication = boolean > > where the expected number of synchronous reports from standby > servers is further limited to the actual number of connected synchronous > standby servers if the value of this GUC is false. This means that if > no standby servers are connected yet then the replication is asynchronous > and transactions are allowed to finish without waiting for synchronous > reports. If the value of this GUC is true, then transactions wait until > enough synchronous standbys connect and report back. Why are these options necessary? Can these options cover more than three synchronization levels? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao írta: > 2010/4/29 Boszormenyi Zoltan <zb@cybertec.at>: > >> attached is a patch that does $SUBJECT, we are submitting it for 9.1. >> I have updated it to today's CVS after the "wal_level" GUC went in. >> > > I'm planning to create the synchronous replication patch for 9.0, too. > My design is outlined in the wiki. Let's work together to do the design > of it. > http://wiki.postgresql.org/wiki/Streaming_Replication#Synchronization_capability > > The log-shipping replication has some synchronization levels as follows. > Which are you going to work on? > > The transaction commit on the master > #1 doesn't wait for replication (already suppored in 9.0) > #2 waits for WAL to be received by the standby > #3 waits for WAL to be received and flushed by the standby > #4 waits for WAL to be received, flushed and replayed by the standby > ..etc? > > I'm planning to add #2 and #3 into 9.1. #4 is useful but is outside > the scope of my development for at least 9.1. In #4, read-only query > can easily block recovery by the lock conflict and make the > transaction commit on the master get stuck. This problem is difficult > to be addressed within 9.1, I think. But the design and implementation > of #2 and #3 need to be easily extensible to #4. > > >> How does it work? >> >> First, the walreceiver and the walsender are now able to communicate >> in a duplex way on the same connection, so while COPY OUT is >> in progress from the primary server, the standby server is able to >> issue PQputCopyData() to pass the transaction IDs that were seen >> with XLOG_XACT_COMMIT or XLOG_XACT_PREPARE >> signatures. I did by adding a new protocol message type, with letter >> 'x' that's only acknowledged by the walsender process. The regular >> backend was intentionally unchanged so an SQL client gets a protocol >> error. A new libpq call called PQsetDuplexCopy() which sends this >> new message before sending START_REPLICATION. The primary >> makes a note of it in the walsender process' entry. >> >> I had to move the TransactionIdLatest(xid, nchildren, children) call >> that computes latestXid earlier in RecordTransactionCommit(), so >> it's in the critical section now, just before the >> XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT, rdata) >> call. Otherwise, there was a race condition between the primary >> and the standby server, where the standby server might have seen >> the XLOG_XACT_COMMIT record for some XIDs before the >> transaction in the primary server marked itself waiting for this XID, >> resulting in stuck transactions. >> > > You seem to have chosen #4 as synchronization level. Right? > Yes. > In your design, the transaction commit on the master waits for its XID > to be read from the XLOG_XACT_COMMIT record and replied by the standby. > Right? This design seems not to be extensible to #2 and #3 since > walreceiver cannot read XID from the XLOG_XACT_COMMIT record. Yes, this was my problem, too. I would have had to implement a custom interpreter into walreceiver to process the WAL records and extract the XIDs. But at least the supporting details, i.e. not opening another connection, instead being able to do duplex COPY operations in a server-acknowledged way is acceptable, no? :-) > How about > using LSN instead of XID? That is, the transaction commit waits until > the standby has reached its LSN. LSN is more easy-used for walreceiver > and startup process, I think. > Indeed, using the LSN seems to be more appropriate for the walreceiver, but how would you extract the information that a certain LSN means a COMMITted transaction? Or we could release a locked transaction in case the master receives an LSN greater than or equal to the transaction's own LSN? Sending back all the LSNs in case of long transactions would increase the network traffic compared to sending back only the XIDs, but the amount is not clear for me. What I am more worried about is the contention on the ProcArrayLock. XIDs are rarer then LSNs, no? > What if the "synchronous" standby starts up from the very old backup? > The transaction on the master needs to wait until a large amount of > outstanding WAL has been applied? I think that synchronous replication > should start with *asynchronous* replication, and should switch to the > sync level after the gap between servers has become enough small. > What's your opinion? > It's certainly one option, which I think partly addressed with the "strict_sync_replication" knob below. If strict_sync_replication = off, then the master doesn't make its transactions wait for the synchronous reports, and the client(s) can work through their WALs. IIRC, the walreceiver connects to the master only very late in the recovery process, no? It would be nicer if it could be made automatic. I simply thought that there may be situations where the "strict" behaviour may be desired. I was thinking about the transactions executed on the master between the standby startup and walreceiver connection. Someone may want to ensure the synchronous behaviour for every xact, no matter the amount of time it needs. Someone else will prefer synchronous behaviour whenever possible but also ensure quick enough response time even if standbys aren't started up yet. This dilemma cried for such a GUC, it cannot be decided automatically. >> I have added 3 new options, two GUCs in postgresql.conf and one >> setting in recovery.conf. These options are: >> >> 1. min_sync_replication_clients = N >> >> where N is the number of reports for a given transaction before it's >> released as committed synchronously. 0 means completely asynchronous, >> the value is maximized by the value of max_wal_senders. Anything >> in between 0 and max_wal_senders means different levels of partially >> synchronous replication. >> >> 2. strict_sync_replication = boolean >> >> where the expected number of synchronous reports from standby >> servers is further limited to the actual number of connected synchronous >> standby servers if the value of this GUC is false. This means that if >> no standby servers are connected yet then the replication is asynchronous >> and transactions are allowed to finish without waiting for synchronous >> reports. If the value of this GUC is true, then transactions wait until >> enough synchronous standbys connect and report back. >> > > Why are these options necessary? > > Can these options cover more than three synchronization levels? > I think I explained it in my mail. If min_sync_replication_clients == 0, then the replication is async. If min_sync_replication_clients == max_wal_senders then the replication is fully synchronous. If 0 < min_sync_replication_clients < max_wal_senders then the replication is partially synchronous, i.e. the master can wait only for say, 50% of the clients to report back before it's considered synchronous and the relevant transactions get released from the wait. Best regards, Zoltán Böszörményi -- Bible has answers for everything. Proof: "But let your communication be, Yea, yea; Nay, nay: for whatsoever is more than these cometh of evil." (Matthew 5:37) - basics of digital technology. "May your kingdom come" - superficial description of plate tectonics ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH http://www.postgresql.at/
On Fri, May 14, 2010 at 9:33 AM, Boszormenyi Zoltan <zb@cybertec.at> wrote: > If min_sync_replication_clients == 0, then the replication is async. > If min_sync_replication_clients == max_wal_senders then the > replication is fully synchronous. > If 0 < min_sync_replication_clients < max_wal_senders then > the replication is partially synchronous, i.e. the master can wait > only for say, 50% of the clients to report back before it's considered > synchronous and the relevant transactions get released from the wait. That's an interesting design and in some ways pretty elegant, but it rules out some things that people might easily want to do - for example, synchronous replication to the other server in the same data center that acts as a backup for the master; and asynchronous replication to a reporting server located off-site. One of the things that I think we will probably need/want to change eventually is the fact that the master has no real knowledge of who the replication slaves are. That might be something we want to change in order to be able to support more configurability. Inventing syntax out of whole cloth and leaving semantics to the imagination of the reader: CREATE REPLICATION SLAVE reporting_server (mode asynchronous, xid_feedback on); CREATE REPLICATION SLAVE failover_server (mode synchronous, xid_feedback off, break_synchrep_timeout 30); -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas írta: > On Fri, May 14, 2010 at 9:33 AM, Boszormenyi Zoltan <zb@cybertec.at> wrote: > >> If min_sync_replication_clients == 0, then the replication is async. >> If min_sync_replication_clients == max_wal_senders then the >> replication is fully synchronous. >> If 0 < min_sync_replication_clients < max_wal_senders then >> the replication is partially synchronous, i.e. the master can wait >> only for say, 50% of the clients to report back before it's considered >> synchronous and the relevant transactions get released from the wait. >> > > That's an interesting design and in some ways pretty elegant, but it > rules out some things that people might easily want to do - for > example, synchronous replication to the other server in the same data > center that acts as a backup for the master; and asynchronous > replication to a reporting server located off-site. > No, it doesn't. :-) You didn't take into account the third knob usable in recovery.conf: synchronous_slave = on/off The off-site reporting server can be an asynchronous standby, while the on-site backup server can be synchronous. The only thing you need to take into account is that min_sync_replication_clients shouldn't ever exceed your actual number of synchronous standbys. The setup these three knobs provide is pretty flexible I think. > One of the things that I think we will probably need/want to change > eventually is the fact that the master has no real knowledge of who > the replication slaves are. The changes I made in my patch partly changes that, the server still doesn't know "who" the standbys are but there's a call that returns the number of connected _synchronous_ standbys. > That might be something we want to change > in order to be able to support more configurability. Inventing syntax > out of whole cloth and leaving semantics to the imagination of the > reader: > > CREATE REPLICATION SLAVE reporting_server (mode asynchronous, xid_feedback on); > CREATE REPLICATION SLAVE failover_server (mode synchronous, > xid_feedback off, break_synchrep_timeout 30); > > -- Bible has answers for everything. Proof: "But let your communication be, Yea, yea; Nay, nay: for whatsoever is more than these cometh of evil." (Matthew 5:37) - basics of digital technology. "May your kingdom come" - superficial description of plate tectonics ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH http://www.postgresql.at/
BTW, What I'd like to see as a very first patch first is to change the current poll loops in walreceiver and walsender to, well, not poll. That's a requirement for synchronous replication, is very useful on its own, and requires a some design and implementation effort to get right. It would be nice to get that out of the way before/during we discuss the more user-visible behavior. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2010-05-14 at 15:15 -0400, Robert Haas wrote: > On Fri, May 14, 2010 at 9:33 AM, Boszormenyi Zoltan <zb@cybertec.at> wrote: > > If min_sync_replication_clients == 0, then the replication is async. > > If min_sync_replication_clients == max_wal_senders then the > > replication is fully synchronous. > > If 0 < min_sync_replication_clients < max_wal_senders then > > the replication is partially synchronous, i.e. the master can wait > > only for say, 50% of the clients to report back before it's considered > > synchronous and the relevant transactions get released from the wait. > > That's an interesting design and in some ways pretty elegant, but it > rules out some things that people might easily want to do - for > example, synchronous replication to the other server in the same data > center that acts as a backup for the master; and asynchronous > replication to a reporting server located off-site. The design above allows the case you mention: min_sync_replication_clients = 1 max_wal_senders = 2 It works well in failure cases, such as the case where the local backup server goes down. It seems exactly what we need to me, though not sure about names. > One of the things that I think we will probably need/want to change > eventually is the fact that the master has no real knowledge of who > the replication slaves are. That might be something we want to change > in order to be able to support more configurability. Inventing syntax > out of whole cloth and leaving semantics to the imagination of the > reader: > > CREATE REPLICATION SLAVE reporting_server (mode asynchronous, xid_feedback on); > CREATE REPLICATION SLAVE failover_server (mode synchronous, > xid_feedback off, break_synchrep_timeout 30); I am against labelling servers as synchronous/asynchronous. We've had this discussion a few times since 2008. There is significant advantage in having the user specify the level of robustness, so that it can vary from transaction to transaction, just as already happens at commit. That way the user gets to say what happens. Look for threads on "transaction controlled robustness". As alluded to above, if you label the servers you also need to say what happens when one or more of them are down. e.g. "synchronous to B AND async to C, except when B is not available, in which case make C synchronous". With N servers, you end up needing to specify O(N^2) rules for what happens, so it only works neatly for 2, maybe 3 servers. -- Simon Riggs www.2ndQuadrant.com
Thanks for your reply! On Fri, May 14, 2010 at 10:33 PM, Boszormenyi Zoltan <zb@cybertec.at> wrote: >> In your design, the transaction commit on the master waits for its XID >> to be read from the XLOG_XACT_COMMIT record and replied by the standby. >> Right? This design seems not to be extensible to #2 and #3 since >> walreceiver cannot read XID from the XLOG_XACT_COMMIT record. > > Yes, this was my problem, too. I would have had to > implement a custom interpreter into walreceiver to > process the WAL records and extract the XIDs. Isn't reading the same WAL twice (by walreceiver and startup process) inefficient? In synchronous replication, the overhead of walreceiver directly affects the performance of the master. We should not assign such a hard work to walreceiver, I think. > But at least the supporting details, i.e. not opening another > connection, instead being able to do duplex COPY operations in > a server-acknowledged way is acceptable, no? :-) Though I might not understand your point (sorry), it's OK for the standby to send the reply to the master by using CopyData message. Currently PQputCopyData() cannot be executed in COPY OUT, but we can relax that. >> How about >> using LSN instead of XID? That is, the transaction commit waits until >> the standby has reached its LSN. LSN is more easy-used for walreceiver >> and startup process, I think. >> > > Indeed, using the LSN seems to be more appropriate for > the walreceiver, but how would you extract the information > that a certain LSN means a COMMITted transaction? Or > we could release a locked transaction in case the master receives > an LSN greater than or equal to the transaction's own LSN? Yep, we can ensure that the transaction has been replicated by comparing its own LSN with the smallest LSN in the latest LSNs of each connected "synchronous" standby. > Sending back all the LSNs in case of long transactions would > increase the network traffic compared to sending back only the > XIDs, but the amount is not clear for me. What I am more > worried about is the contention on the ProcArrayLock. > XIDs are rarer then LSNs, no? No. For example, when WAL data sent by walsender at a time has two XLOG_XACT_COMMIT records, in XID approach, walreceiver would need to send two replies. OTOH, in LSN approach, only one reply which indicates the last received location would need to be sent. >> What if the "synchronous" standby starts up from the very old backup? >> The transaction on the master needs to wait until a large amount of >> outstanding WAL has been applied? I think that synchronous replication >> should start with *asynchronous* replication, and should switch to the >> sync level after the gap between servers has become enough small. >> What's your opinion? >> > > It's certainly one option, which I think partly addressed > with the "strict_sync_replication" knob below. > If strict_sync_replication = off, then the master doesn't make > its transactions wait for the synchronous reports, and the client(s) > can work through their WALs. IIRC, the walreceiver connects > to the master only very late in the recovery process, no? No, the master might have a large number of WAL files which the standby doesn't have. >>> I have added 3 new options, two GUCs in postgresql.conf and one >>> setting in recovery.conf. These options are: >>> >>> 1. min_sync_replication_clients = N >>> >>> where N is the number of reports for a given transaction before it's >>> released as committed synchronously. 0 means completely asynchronous, >>> the value is maximized by the value of max_wal_senders. Anything >>> in between 0 and max_wal_senders means different levels of partially >>> synchronous replication. >>> >>> 2. strict_sync_replication = boolean >>> >>> where the expected number of synchronous reports from standby >>> servers is further limited to the actual number of connected synchronous >>> standby servers if the value of this GUC is false. This means that if >>> no standby servers are connected yet then the replication is asynchronous >>> and transactions are allowed to finish without waiting for synchronous >>> reports. If the value of this GUC is true, then transactions wait until >>> enough synchronous standbys connect and report back. >>> >> >> Why are these options necessary? >> >> Can these options cover more than three synchronization levels? >> > > I think I explained it in my mail. > > If min_sync_replication_clients == 0, then the replication is async. > If min_sync_replication_clients == max_wal_senders then the > replication is fully synchronous. > If 0 < min_sync_replication_clients < max_wal_senders then > the replication is partially synchronous, i.e. the master can wait > only for say, 50% of the clients to report back before it's considered > synchronous and the relevant transactions get released from the wait. Seems s/min_sync_replication_clients/max_sync_replication_clients min_sync_replication_clients is required to prevent outside attacker from connecting to the master as "synchronous" standby, and degrading the performance on the master? Other usecase? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Sat, May 15, 2010 at 4:59 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > BTW, What I'd like to see as a very first patch first is to change the > current poll loops in walreceiver and walsender to, well, not poll. > That's a requirement for synchronous replication, is very useful on its > own, and requires a some design and implementation effort to get right. > It would be nice to get that out of the way before/during we discuss the > more user-visible behavior. Yeah, we should wake up the walesender from sleep to send WAL data as soon as it's flushed. But why do we need to change the loop of walreceiver? Or you mean changing the poll loop in the startup process? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 18/05/10 07:41, Fujii Masao wrote: > On Sat, May 15, 2010 at 4:59 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> BTW, What I'd like to see as a very first patch first is to change the >> current poll loops in walreceiver and walsender to, well, not poll. >> That's a requirement for synchronous replication, is very useful on its >> own, and requires a some design and implementation effort to get right. >> It would be nice to get that out of the way before/during we discuss the >> more user-visible behavior. > > Yeah, we should wake up the walesender from sleep to send WAL data > as soon as it's flushed. But why do we need to change the loop of > walreceiver? Or you mean changing the poll loop in the startup process? Yeah, changing the poll loop in the startup process is what I meant. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Fujii Masao írta: > Thanks for your reply! > > On Fri, May 14, 2010 at 10:33 PM, Boszormenyi Zoltan <zb@cybertec.at> wrote: > >>> In your design, the transaction commit on the master waits for its XID >>> to be read from the XLOG_XACT_COMMIT record and replied by the standby. >>> Right? This design seems not to be extensible to #2 and #3 since >>> walreceiver cannot read XID from the XLOG_XACT_COMMIT record. >>> >> Yes, this was my problem, too. I would have had to >> implement a custom interpreter into walreceiver to >> process the WAL records and extract the XIDs. >> > > Isn't reading the same WAL twice (by walreceiver and startup process) > inefficient? Yes, and I didn't implement that because it's inefficient. I implemented a minimal communication between StartupXLOG() and the walreceiver. > In synchronous replication, the overhead of walreceiver > directly affects the performance of the master. We should not assign > such a hard work to walreceiver, I think. > Exactly. >> But at least the supporting details, i.e. not opening another >> connection, instead being able to do duplex COPY operations in >> a server-acknowledged way is acceptable, no? :-) >> > > Though I might not understand your point (sorry), it's OK for the standby > to send the reply to the master by using CopyData message. I thought about the same. > Currently > PQputCopyData() cannot be executed in COPY OUT, but we can relax > that. > And I implemented just that, in a way that upon walreceiver startup it sends a new protocol message to the walsender by calling PQsetDuplexCopy() (see my patch) and the walsender response is ACK. This protocol message is intentionally not handled by the normal backend, so plain libpq clients cannot mess up their COPY streams. >>> How about >>> using LSN instead of XID? That is, the transaction commit waits until >>> the standby has reached its LSN. LSN is more easy-used for walreceiver >>> and startup process, I think. >>> >>> >> Indeed, using the LSN seems to be more appropriate for >> the walreceiver, but how would you extract the information >> that a certain LSN means a COMMITted transaction? Or >> we could release a locked transaction in case the master receives >> an LSN greater than or equal to the transaction's own LSN? >> > > Yep, we can ensure that the transaction has been replicated by > comparing its own LSN with the smallest LSN in the latest LSNs > of each connected "synchronous" standby. > > >> Sending back all the LSNs in case of long transactions would >> increase the network traffic compared to sending back only the >> XIDs, but the amount is not clear for me. What I am more >> worried about is the contention on the ProcArrayLock. >> XIDs are rarer then LSNs, no? >> > > No. For example, when WAL data sent by walsender at a time > has two XLOG_XACT_COMMIT records, in XID approach, walreceiver > would need to send two replies. OTOH, in LSN approach, only > one reply which indicates the last received location would > need to be sent. > I see. >>> What if the "synchronous" standby starts up from the very old backup? >>> The transaction on the master needs to wait until a large amount of >>> outstanding WAL has been applied? I think that synchronous replication >>> should start with *asynchronous* replication, and should switch to the >>> sync level after the gap between servers has become enough small. >>> What's your opinion? >>> >>> >> It's certainly one option, which I think partly addressed >> with the "strict_sync_replication" knob below. >> If strict_sync_replication = off, then the master doesn't make >> its transactions wait for the synchronous reports, and the client(s) >> can work through their WALs. IIRC, the walreceiver connects >> to the master only very late in the recovery process, no? >> > > No, the master might have a large number of WAL files which > the standby doesn't have. > We can change the walreceiver so it sends similarly encapsulated messages as the walsender does. In our patch, the walreceiver currently sends the raw XIDs. If we add a minimal protocol encapsulation, we can distinguish between the XIDs (or later LSNs) and the "mark me synchronous from now on" message. The only problem is: what should be the point when such a client becomes synchronous from the master's POV, so the XID/LSN reports will count and transactions are made to wait for this client? As a side note, the async walreceivers' behaviour should be kept so they don't send anything back and the message that PQsetDuplexCopy() sends to the master would then only prepare the walsender that its client will become synchronous in the near future. >>>> I have added 3 new options, two GUCs in postgresql.conf and one >>>> setting in recovery.conf. These options are: >>>> >>>> 1. min_sync_replication_clients = N >>>> >>>> where N is the number of reports for a given transaction before it's >>>> released as committed synchronously. 0 means completely asynchronous, >>>> the value is maximized by the value of max_wal_senders. Anything >>>> in between 0 and max_wal_senders means different levels of partially >>>> synchronous replication. >>>> >>>> 2. strict_sync_replication = boolean >>>> >>>> where the expected number of synchronous reports from standby >>>> servers is further limited to the actual number of connected synchronous >>>> standby servers if the value of this GUC is false. This means that if >>>> no standby servers are connected yet then the replication is asynchronous >>>> and transactions are allowed to finish without waiting for synchronous >>>> reports. If the value of this GUC is true, then transactions wait until >>>> enough synchronous standbys connect and report back. >>>> >>>> >>> Why are these options necessary? >>> >>> Can these options cover more than three synchronization levels? >>> >>> >> I think I explained it in my mail. >> >> If min_sync_replication_clients == 0, then the replication is async. >> If min_sync_replication_clients == max_wal_senders then the >> replication is fully synchronous. >> If 0 < min_sync_replication_clients < max_wal_senders then >> the replication is partially synchronous, i.e. the master can wait >> only for say, 50% of the clients to report back before it's considered >> synchronous and the relevant transactions get released from the wait. >> > > Seems s/min_sync_replication_clients/max_sync_replication_clients > No, "min" is indicating the minimum number of walreceiver reports needed before a transaction can be released from under the waiting. The other reports coming from walreceivers are ignored. > min_sync_replication_clients is required to prevent outside attacker > from connecting to the master as "synchronous" standby, and degrading > the performance on the master? ??? Properly configured pg_hba.conf prevents outside attackers to connect as replication clients, no? > Other usecase? > > Regards, > > -- Bible has answers for everything. Proof: "But let your communication be, Yea, yea; Nay, nay: for whatsoever is more than these cometh of evil." (Matthew 5:37) - basics of digital technology. "May your kingdom come" - superficial description of plate tectonics ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH http://www.postgresql.at/
On Wed, May 19, 2010 at 5:41 PM, Boszormenyi Zoltan <zb@cybertec.at> wrote: >> Isn't reading the same WAL twice (by walreceiver and startup process) >> inefficient? > > Yes, and I didn't implement that because it's inefficient. So I'd like to propose to use LSN instead of XID since LSN can be easily handled by both walreceiver and startup process. >> Currently >> PQputCopyData() cannot be executed in COPY OUT, but we can relax >> that. >> > > And I implemented just that, in a way that upon walreceiver startup > it sends a new protocol message to the walsender by calling > PQsetDuplexCopy() (see my patch) and the walsender response is ACK. > This protocol message is intentionally not handled by the normal > backend, so plain libpq clients cannot mess up their COPY streams. The newly-introduced message type "Set Duplex Copy" is really required? I think that the standby can send its replication mode to the master via Query or CopyData message, which are already used in SR. For example, how about including the mode in the handshake message "START_REPLICATION"? If we do that, we would not need to introduce new libpq function PQsetDuplexCopy(). BTW, I often got the complaints about adding new libpq function when I implemented SR ;) In the patch, PQputCopyData() checks the newly-introduced pg_conn field "duplexCopy". Instead, how about checking the existing field "replication"? Or we can just allow PQputCopyData() to go even in COPY OUT state. > We can change the walreceiver so it sends similarly encapsulated > messages as the walsender does. In our patch, the walreceiver > currently sends the raw XIDs. If we add a minimal protocol > encapsulation, we can distinguish between the XIDs (or later LSNs) > and the "mark me synchronous from now on" message. > > The only problem is: what should be the point when such a client > becomes synchronous from the master's POV, so the XID/LSN reports > will count and transactions are made to wait for this client? One idea is to switch to "sync" when the gap of LSN becomes less than or equal to XLOG_SEG_SIZE (currently 8MB). That is, walsender calculates the gap from the current write WAL location on the master and the last receive/flush/replay location on the standby. And if the gap <= XLOG_SEG_SIZE, it instructs backends to wait for replication from then on. > As a side note, the async walreceivers' behaviour should be kept > so they don't send anything back and the message that > PQsetDuplexCopy() sends to the master would then only > prepare the walsender that its client will become synchronous > in the near future. I agree that walreceiver should send no replication ack if "async" mode is chosen. OTOH, in "sync" case, walreceiver should always send ack even if the gap is large and the master doesn't wait for replication yet. As mentioned above, walsender needs to calculate the gap from the ack. >> Seems s/min_sync_replication_clients/max_sync_replication_clients >> > > No, "min" is indicating the minimum number of walreceiver reports > needed before a transaction can be released from under the waiting. > The other reports coming from walreceivers are ignored. Hmm... when min_sync_replication_clients = 2 and there are three "synchronous" standbys, the master waits for only two standbys? The standby which the master ignores is fixed? or dynamically (or randomly) changed? >> min_sync_replication_clients is required to prevent outside attacker >> from connecting to the master as "synchronous" standby, and degrading >> the performance on the master? > > ??? > > Properly configured pg_hba.conf prevents outside attackers > to connect as replication clients, no? Yes :) I'd like to just know the use case of min_sync_replication_clients. Sorry, I've not understood yet how useful this option is. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao írta: > On Wed, May 19, 2010 at 5:41 PM, Boszormenyi Zoltan <zb@cybertec.at> wrote: > >>> Isn't reading the same WAL twice (by walreceiver and startup process) >>> inefficient? >>> >> Yes, and I didn't implement that because it's inefficient. >> > > So I'd like to propose to use LSN instead of XID since LSN can > be easily handled by both walreceiver and startup process. > OK, I will look into it replacing XIDs with LSNs. >>> Currently >>> PQputCopyData() cannot be executed in COPY OUT, but we can relax >>> that. >>> >>> >> And I implemented just that, in a way that upon walreceiver startup >> it sends a new protocol message to the walsender by calling >> PQsetDuplexCopy() (see my patch) and the walsender response is ACK. >> This protocol message is intentionally not handled by the normal >> backend, so plain libpq clients cannot mess up their COPY streams. >> > > The newly-introduced message type "Set Duplex Copy" is really required? > I think that the standby can send its replication mode to the master > via Query or CopyData message, which are already used in SR. For example, > how about including the mode in the handshake message "START_REPLICATION"? > If we do that, we would not need to introduce new libpq function > PQsetDuplexCopy(). BTW, I often got the complaints about adding > new libpq function when I implemented SR ;) > :-) > In the patch, PQputCopyData() checks the newly-introduced pg_conn field > "duplexCopy". Instead, how about checking the existing field "replication"? > I didn't see there was such a new field. (looking...) I can see now, it was added in the middle of the structure. Ok, we can then use it to allow duplex COPY instead of my new field. I suppose it's non-NULL if replication is on, right? Then the extra call is not needed then. > Or we can just allow PQputCopyData() to go even in COPY OUT state. > I think this may not be too useful for SQL clients, but who knows? :-) Use cases, anyone? >> We can change the walreceiver so it sends similarly encapsulated >> messages as the walsender does. In our patch, the walreceiver >> currently sends the raw XIDs. If we add a minimal protocol >> encapsulation, we can distinguish between the XIDs (or later LSNs) >> and the "mark me synchronous from now on" message. >> >> The only problem is: what should be the point when such a client >> becomes synchronous from the master's POV, so the XID/LSN reports >> will count and transactions are made to wait for this client? >> > > One idea is to switch to "sync" when the gap of LSN becomes less > than or equal to XLOG_SEG_SIZE (currently 8MB). That is, walsender > calculates the gap from the current write WAL location on the master > and the last receive/flush/replay location on the standby. And if > the gap <= XLOG_SEG_SIZE, it instructs backends to wait for > replication from then on. > This is a sensible idea. >> As a side note, the async walreceivers' behaviour should be kept >> so they don't send anything back and the message that >> PQsetDuplexCopy() sends to the master would then only >> prepare the walsender that its client will become synchronous >> in the near future. >> > > I agree that walreceiver should send no replication ack if "async" > mode is chosen. OTOH, in "sync" case, walreceiver should always > send ack even if the gap is large and the master doesn't wait for > replication yet. As mentioned above, walsender needs to calculate > the gap from the ack. > Agreed. >>> Seems s/min_sync_replication_clients/max_sync_replication_clients >>> >>> >> No, "min" is indicating the minimum number of walreceiver reports >> needed before a transaction can be released from under the waiting. >> The other reports coming from walreceivers are ignored. >> > > Hmm... when min_sync_replication_clients = 2 and there are three > "synchronous" standbys, the master waits for only two standbys? > Yes. This is the idea, "partially synchronous replication". I heard anecdotes about replication solutions where say ensuring that (say) if at least 50% of the machines across the whole cluster report back synchronously then the transaction is considered replicated "good enough". > The standby which the master ignores is fixed? or dynamically (or > randomly) changed? > It may be randomly changed, depending on who send the reports first. The replication servers themselves may get very busy with large queries or they may be loaded by some other ways and be somewhat late in processing the WAL stream. The less loaded servers answer first, and the transaction is considered properly replicated. >>> min_sync_replication_clients is required to prevent outside attacker >>> from connecting to the master as "synchronous" standby, and degrading >>> the performance on the master? >>> >> ??? >> >> Properly configured pg_hba.conf prevents outside attackers >> to connect as replication clients, no? >> > > Yes :) > > I'd like to just know the use case of min_sync_replication_clients. > Sorry, I've not understood yet how useful this option is. > I hope I answered it. :-) Best regards, Zoltán Böszörményi -- Bible has answers for everything. Proof: "But let your communication be, Yea, yea; Nay, nay: for whatsoever is more than these cometh of evil." (Matthew 5:37) - basics of digital technology. "May your kingdom come" - superficial description of plate tectonics ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH http://www.postgresql.at/
On Wed, May 19, 2010 at 9:58 PM, Boszormenyi Zoltan <zb@cybertec.at> wrote: >> In the patch, PQputCopyData() checks the newly-introduced pg_conn field >> "duplexCopy". Instead, how about checking the existing field "replication"? > > I didn't see there was such a new field. (looking...) I can see now, > it was added in the middle of the structure. Ok, we can then use it > to allow duplex COPY instead of my new field. I suppose it's non-NULL > if replication is on, right? Then the extra call is not needed then. Right. Usually the first byte of the pg_conn field seems to be also checked as follows, but I'm not sure if that is valuable for this case. if (conn->replication && conn->replication[0]) >> Or we can just allow PQputCopyData() to go even in COPY OUT state. > > I think this may not be too useful for SQL clients, but who knows? :-) > Use cases, anyone? It's for only replication. >> Hmm... when min_sync_replication_clients = 2 and there are three >> "synchronous" standbys, the master waits for only two standbys? >> > > Yes. This is the idea, "partially synchronous replication". > I heard anecdotes about replication solutions where say > ensuring that (say) if at least 50% of the machines across the > whole cluster report back synchronously then the transaction > is considered replicated "good enough". Oh, I got. I heard such a use case for the first time. We seem to have many ideas about the knobs to control synchronization levels, and would need to clarify which ones to be implemented for 9.1. >> I'd like to just know the use case of min_sync_replication_clients. >> Sorry, I've not understood yet how useful this option is. >> > > I hope I answered it. :-) Yep. Thanks! Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center