Thread: Synchronous replication, reading WAL for sending
As the patch stands, whenever XLOG segment is switched in XLogInsert, we wait for the segment to be sent to the standby server. That's not good. Particularly in asynchronous mode, you'd expect the standby to not have any significant ill effect on the master. But in case of a flaky network connection, or a busy or dead standby, it can take a long time for the standby to respond, or the primary to give up. During that time, all WAL insertions on the primary are blocked. (How long is the default TCP timeout again?) Another point is that in the future, we really shouldn't require setting up archiving and file-based log shipping using external scripts, when all you want is replication. It should be enough to restore a base backup on the standby, and point it to the IP address of the primary, and have it catch up. This is very important, IMHO. It's quite a lot of work to set up archiving and log-file shipping, for no obvious reason. It's really only needed at the moment because we're building this feature from spare parts. For those reasons, we need a way to send arbitrary ranges of WAL from primary to standby. The current method where the WAL is read from wal_buffers obviously only works for very recent WAL pages that are still in wal_buffers. The design should be changed so that instead of reading from wal_buffers, the WAL is read from filesystem. Sending directly from wal_buffers can be provided as a fastpath when sending recent enough WAL range, but I wouldn't bother complicating the code for now. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2008-12-23 at 17:42 +0200, Heikki Linnakangas wrote: > As the patch stands, whenever XLOG segment is switched in XLogInsert, we > wait for the segment to be sent to the standby server. That's not good. > Particularly in asynchronous mode, you'd expect the standby to not have > any significant ill effect on the master. But in case of a flaky network > connection, or a busy or dead standby, it can take a long time for the > standby to respond, or the primary to give up. During that time, all WAL > insertions on the primary are blocked. (How long is the default TCP > timeout again?) Ugh, didn't see that. Get rid of that. We managed to get rid of the fsync of the control file when we changed WAL file at start of 8.3. That had a major effect on performance, via reduced response time profiles. No need to re-introduce a delay in the same place. > Another point is that in the future, we really shouldn't require setting > up archiving and file-based log shipping using external scripts, when > all you want is replication. It should be enough to restore a base > backup on the standby, and point it to the IP address of the primary, > and have it catch up. This is very important, IMHO. It's quite a lot of > work to set up archiving and log-file shipping, for no obvious reason. > It's really only needed at the moment because we're building this > feature from spare parts. Happy for that to be hidden more from users. > For those reasons, we need a way to send arbitrary ranges of WAL from > primary to standby. The current method where the WAL is read from > wal_buffers obviously only works for very recent WAL pages that are > still in wal_buffers. The design should be changed so that instead of > reading from wal_buffers, the WAL is read from filesystem. There are two basic ways: from memory and from files. Sure we can hide the two mechanisms in code better, but they will remain fairly distinct. > Sending directly from wal_buffers can be provided as a fastpath when > sending recent enough WAL range, but I wouldn't bother complicating the > code for now. Sounds like you are saying completely replace the write-from-buffers and replace it with write-from-file? Sending from wal_buffers is OK if wal_buffers is large enough. If streaming replication falls so far behind that we have problems then there are larger issues to worry about, like is the primary being driven too hard for the network to cope. Copying direct from memory means that a disk problem that occurs on the primary will never cause corruption on the standby. Reading WAL files can mean that corruptions get propagated. The current design allows for file based WAL sending, if the connection is so poor that streaming won't work. If you are seriously suggesting these things now then I'd like to see some diagrams, designs and descriptions so we can all understand what is being suggested, how it will cope with all the current requirements. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, On Wed, Dec 24, 2008 at 1:48 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Tue, 2008-12-23 at 17:42 +0200, Heikki Linnakangas wrote: > >> As the patch stands, whenever XLOG segment is switched in XLogInsert, we >> wait for the segment to be sent to the standby server. That's not good. >> Particularly in asynchronous mode, you'd expect the standby to not have >> any significant ill effect on the master. But in case of a flaky network >> connection, or a busy or dead standby, it can take a long time for the >> standby to respond, or the primary to give up. During that time, all WAL >> insertions on the primary are blocked. (How long is the default TCP >> timeout again?) > > Ugh, didn't see that. Get rid of that. We managed to get rid of the > fsync of the control file when we changed WAL file at start of 8.3. That > had a major effect on performance, via reduced response time profiles. > No need to re-introduce a delay in the same place. Yes, I will get rid of it. It's only async case? both(async & sync)? >> For those reasons, we need a way to send arbitrary ranges of WAL from >> primary to standby. The current method where the WAL is read from >> wal_buffers obviously only works for very recent WAL pages that are >> still in wal_buffers. The design should be changed so that instead of >> reading from wal_buffers, the WAL is read from filesystem. Filesystem you say is only pg_xlog? If it includes archive, we might have to execute restore command in order to read WAL from filesystem? > If you are seriously suggesting these things now then I'd like to see > some diagrams, designs and descriptions so we can all understand what is > being suggested, how it will cope with all the current requirements. I also want. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Dec 23, 2008 at 9:12 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > As the patch stands, whenever XLOG segment is switched in XLogInsert, we > wait for the segment to be sent to the standby server. That's not good. > Particularly in asynchronous mode, you'd expect the standby to not have any > significant ill effect on the master. But in case of a flaky network > connection, or a busy or dead standby, it can take a long time for the > standby to respond, or the primary to give up. During that time, all WAL > insertions on the primary are blocked. (How long is the default TCP timeout > again?) > > Another point is that in the future, we really shouldn't require setting up > archiving and file-based log shipping using external scripts, when all you > want is replication. It should be enough to restore a base backup on the > standby, and point it to the IP address of the primary, and have it catch > up. This is very important, IMHO. It's quite a lot of work to set up > archiving and log-file shipping, for no obvious reason. It's really only > needed at the moment because we're building this feature from spare parts. > I had similar suggestions when I first wrote the high level design doc. From the wiki page: - WALSender reads from WAL buffers and/or WAL files and sends the buffers to WALReceiver. In phase one, we may assume that WALSender can only read from WAL buffers and WAL files in pg_xlog directory. Later on, this can be improved so that WALSender can temporarily restore archived files and read from that too. I am not so sure about whether we must support archive files or not, but I agree that at least supporting pg_xlog files will be necessary if we want to support seamless catchup after restart. > For those reasons, we need a way to send arbitrary ranges of WAL from > primary to standby. The current method where the WAL is read from > wal_buffers obviously only works for very recent WAL pages that are still in > wal_buffers. The design should be changed so that instead of reading from > wal_buffers, the WAL is read from filesystem. > > Sending directly from wal_buffers can be provided as a fastpath when sending > recent enough WAL range, but I wouldn't bother complicating the code for > now. > How would that work for sync replication ? Or are you suggesting that the WAL first written to the disk and then again read back to be sent to the standby ? I think the reading from files is addition work in the sync path when we already have access to the WAL buffers. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Hi, On Wed, Dec 24, 2008 at 2:34 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > On Tue, Dec 23, 2008 at 9:12 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> As the patch stands, whenever XLOG segment is switched in XLogInsert, we >> wait for the segment to be sent to the standby server. That's not good. >> Particularly in asynchronous mode, you'd expect the standby to not have any >> significant ill effect on the master. But in case of a flaky network >> connection, or a busy or dead standby, it can take a long time for the >> standby to respond, or the primary to give up. During that time, all WAL >> insertions on the primary are blocked. (How long is the default TCP timeout >> again?) >> >> Another point is that in the future, we really shouldn't require setting up >> archiving and file-based log shipping using external scripts, when all you >> want is replication. It should be enough to restore a base backup on the >> standby, and point it to the IP address of the primary, and have it catch >> up. This is very important, IMHO. It's quite a lot of work to set up >> archiving and log-file shipping, for no obvious reason. It's really only >> needed at the moment because we're building this feature from spare parts. >> > > I had similar suggestions when I first wrote the high level design doc. > From the wiki page: > > - WALSender reads from WAL buffers and/or WAL files and sends the > buffers to WALReceiver. In phase one, we may assume that WALSender can > only read from WAL buffers and WAL files in pg_xlog directory. Later > on, this can be improved so that WALSender can temporarily restore > archived files and read from that too. You mean that only walsender performs xlog streaming and copying from pg_xlog serially? I think that this would degrade the performance. And, I'm worried about the situation that the speed to generate xlog on the primary is higher than that to copy them to the standby. We might not be able to start xlog streaming forever. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Dec 24, 2008 at 1:50 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > > > And, I'm worried about the situation that the speed to generate xlog > on the primary is higher than that to copy them to the standby. We > might not be able to start xlog streaming forever. > If that's the case, how do you expect the standby to keep pace with the primary after initial sync up ? Frankly, I myself have every doubt that on a relatively high load setup, the standby will not be able keep pace with the primary for two reasons: - Lack of read ahead of data blocks (Suzuki-san's work may help this) - Single threaded recovery But then these are general problems which may impact any log-based replication. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Hi, On Wed, Dec 24, 2008 at 5:48 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > On Wed, Dec 24, 2008 at 1:50 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> >> >> And, I'm worried about the situation that the speed to generate xlog >> on the primary is higher than that to copy them to the standby. We >> might not be able to start xlog streaming forever. >> > > If that's the case, how do you expect the standby to keep pace with > the primary after initial sync up ? Good question. If streaming and copying are performed parallelly, such situation doesn't happen because the speed to generate xlog also depends on streaming. This is a price to pay. I think that the serial operations would need a "pace maker". And, I don't know better pace maker than concurrent streaming. > Frankly, I myself have every doubt > that on a relatively high load setup, the standby will not be able > keep pace with the primary for two reasons: > > - Lack of read ahead of data blocks (Suzuki-san's work may help this) > - Single threaded recovery > > But then these are general problems which may impact any log-based replication. Right. Completely high load setup is probably impossible. There is certainly a price to pay. But, in order to reduce a price as much as possible, I think that we should not focus two or more operations on single process (walsender) just like single threaded recovery. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, 2008-12-24 at 18:31 +0900, Fujii Masao wrote: > > Frankly, I myself have every doubt > > that on a relatively high load setup, the standby will not be able > > keep pace with the primary for two reasons: > > > > - Lack of read ahead of data blocks (Suzuki-san's work may help this) > > - Single threaded recovery > > > > But then these are general problems which may impact any log-based replication. > > Right. Completely high load setup is probably impossible. There is > certainly a price to pay. But, in order to reduce a price as much as > possible, I think that we should not focus two or more operations > on single process (walsender) just like single threaded recovery. I think we may be pleasantly surprised. In 8.3 there were two main sources of wait: * restartpoints * waiting for archive files Restartpoints will now be handled by bgwriter, giving probably 20% gain, plus the WAL data is streamed directly into memory by walreceiver. So I think the startup process may achieve a better steady state and perform very quickly. Suzuki-san's numbers show that full_page_writes = on does not benefit significantly from having read ahead and we already know that is effective in reducing the I/O bottleneck during recovery. If we want to speed up recovery more, I think we'll see the need for an additional process to do WAL CRC checks. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, Dec 24, 2008 at 3:01 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > > > Good question. If streaming and copying are performed parallelly, > such situation doesn't happen because the speed to generate xlog > also depends on streaming. This is a price to pay. I think that the > serial operations would need a "pace maker". And, I don't know > better pace maker than concurrent streaming. > These operations need not be even parallel. My apologies if this has been discussed before, but what we are talking about is just a stream of WAL starting at some LSN. The only difference is that the LSN itself may be in buffers or in the files. So walsender would send as much as it can from the files and then switch to read from buffers. Also, I think you are underestimating the power of network for most practical purposes. Networks are usually not bottlenecks unless we are talking about slow WAN setups which I am not sure how common for PG users. . > > Right. Completely high load setup is probably impossible. If that's the case, I don't think you need to worry too much about network or the walsender being a bottleneck for initial sync up (and note that we are only talking about WAL sync up and not the base backup). Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
On Wed, Dec 24, 2008 at 3:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > If we want to speed up recovery more, I think we'll see the need for an > additional process to do WAL CRC checks. > Yeah, any such helper process along with other optimizations would certainly help. But I can't still believe that on a high load, high end setup, single recovery process without any read-ahead for data blocks, can keep pace with the WAL generated by hundreds of processes at the primary and shipped over a high speed link to standby. BTW, on a completely different note, given that the entire recovery is based on physical redo, are there any inherent limitations that we can't do parallel recovery where different recovery processes apply redo logs to completely independent set of data blocks ? I also sometimes wonder why we don't have block level recovery when a single block in the database is corrupted. Can't this be done by just selectively applying WAL records to that particular block ? If it's just because nobody had time/interest to do this, then it's OK, but I wonder if there are any design issues. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
On Wed, 2008-12-24 at 15:51 +0530, Pavan Deolasee wrote: > On Wed, Dec 24, 2008 at 3:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > > > > If we want to speed up recovery more, I think we'll see the need for an > > additional process to do WAL CRC checks. > > > > Yeah, any such helper process along with other optimizations would > certainly help. But I can't still believe that on a high load, high > end setup, single recovery process without any read-ahead for data > blocks, can keep pace with the WAL generated by hundreds of processes > at the primary and shipped over a high speed link to standby. Suzuki-san has provided measurements. I think we need more. With bgwriter performing restartpoints, we'll find that more RAM helps much more than it did previously. > BTW, on a completely different note, given that the entire recovery is > based on physical redo, are there any inherent limitations that we > can't do parallel recovery where different recovery processes apply > redo logs to completely independent set of data blocks ? That's possible, but will significantly complicate the recovery code. Retaining the ability to do standby queries would be almost impossible in that case, since you would need to parallelise the WAL stream without changing the commit order of transactions. The main CPU bottleneck is CRC, by a long way. If we move effort away from startup process that is the best next action, AFAICS. > I also > sometimes wonder why we don't have block level recovery when a single > block in the database is corrupted. Can't this be done by just > selectively applying WAL records to that particular block ? If it's > just because nobody had time/interest to do this, then it's OK, but I > wonder if there are any design issues. You'll be able to do this with my rmgr patch. Selective application of WAL records is one of the primary use cases, but there are many others. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Fujii Masao wrote: >> - WALSender reads from WAL buffers and/or WAL files and sends the >> buffers to WALReceiver. In phase one, we may assume that WALSender can >> only read from WAL buffers and WAL files in pg_xlog directory. Later >> on, this can be improved so that WALSender can temporarily restore >> archived files and read from that too. >> > You mean that only walsender performs xlog streaming and copying > from pg_xlog serially? I think that this would degrade the performance. > And, I'm worried about the situation that the speed to generate xlog > on the primary is higher than that to copy them to the standby. We > might not be able to start xlog streaming forever. > I've seen a few references to this. Somebody else mentioned how a single TCP/IP stream might not have the bandwidth to match changes to the database. TCP/IP streams do have a window size that adjusts with the load, and unless one gets into aggressive networking such as bittorrent which arguably reduce performance of the entire network, why shouldn't one TCP/IP stream be enough? And if one TCP/IP stream isn't enough, doesn't this point to much larger problems, that won't be solved by streaming it some other way over the network? As in, it doesn't matter what you do - your network pipe isn't big enough? Over the Internet from my house to a co-located box, I can reliably get 1.1+ Mbyte/s using a single TCP/IP connection. The network connection at the co-lo is 10Mbit/s and my Internet connection to my house is also 10Mbit/s. One TCP/IP connection seems pretty capable to stream data to the full potential of the network... Also, I assume that most database loads have peaks and lows. Especially for very larger updates, perhaps end of day processing, I see it as a guarantee that all of the stand bys will fall "more behind" for a period (a few seconds to a minute?), but they will catch up shortly after the peak is over. Cheers, mark -- Mark Mielke <mark@mielke.cc>