Thread: Proposal for 9.1: WAL streaming from WAL buffers
Hi, In 9.0, walsender reads WAL always from the disk and sends it to the standby. That is, we cannot send WAL until it has been written (and flushed) to the disk. This degrades the performance of synchronous replication very much since a transaction commit must wait for the WAL write time *plus* the replication time. The attached patch enables walsender to read data from WAL buffers in addition to the disk. Since we can write and send WAL simultaneously, in synchronous replication, a transaction commit has only to wait for either of them. So the performance would significantly increase. Now three hackers (Zoltan, Simon and me) are planning to develop synchronous replication feature. I'm not sure whose patch will be committed at last. But since the attached patch provides just a infrastructure to optimize SR, it would work fine with any of them together and have a good effect. I'll add the patch into the next CF. AFAIK the ReviewFest will start Jun 15. During that, if you are interested in the patch, please feel free to review it. Also you can get the code change from my git repository: git://git.postgresql.org/git/users/fujii/postgres.git branch: read-wal-buffers >From here I talk about the detail of the change. At first, walsender reads WAL from the disk. If it has reached the current write location (i.e., there is no unsent WAL in the disk), then it attempts to read from WAL buffers. This buffer reading continues until the WAL to send has been purged from WAL buffers. IOW, If WAL buffers is large enough and walsender has been catching up with insertion of WAL, it can read WAL from the buffers forever. Then if WAL to send has purged from the buffers, walsender backs off and tries to read it from the disk. If we can find no WAL to send in the disk, walsender attempts to read WAL from the buffers again. Walsender repeats these operations. The location of the oldest record in the buffers is saved in the shared memory. This location is used to calculate whether the particular WAL is in the buffers or not. To avoid lock contention, walsender reads WAL buffers and XLogCtl->xlblocks without holding neither WALInsertLock nor WALWriteLock. Of course, they might be changed because of buffer replacement while being read. So after reading them, we check that what we read was valid by comparing the location of the read WAL with the location of the oldest record in the buffers. This logic is similar to what XLogRead() does at the end. This feature is required for preventing the performance of synchronous replication from dropping significantly. It can cut the time that a transaction committed on the master takes to become visible on the standby. So, it's also useful for asynchronous replication. Thought? Comment? Objection? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
On Fri, Jun 11, 2010 at 9:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > Thought? Comment? Objection? What happens if the WAL is streamed to the standby and then the master crashes without writing that WAL to disk? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Fri, Jun 11, 2010 at 10:22 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jun 11, 2010 at 9:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> Thought? Comment? Objection? > > What happens if the WAL is streamed to the standby and then the master > crashes without writing that WAL to disk? What are you concerned about? I think that the situation would be the same as 9.0 from users' perspective. After failover, the transaction which a client regards as aborted (because of the crash) might be visible or invisible on new master (i.e., original standby). For now, we cannot control that. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Jun 11, 2010 at 9:57 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Fri, Jun 11, 2010 at 10:22 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Jun 11, 2010 at 9:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> Thought? Comment? Objection? >> >> What happens if the WAL is streamed to the standby and then the master >> crashes without writing that WAL to disk? > > What are you concerned about? > > I think that the situation would be the same as 9.0 from users' perspective. > After failover, the transaction which a client regards as aborted (because > of the crash) might be visible or invisible on new master (i.e., original > standby). For now, we cannot control that. I think the failover case might be OK. But if the master crashes and restarts, the slave might be left thinking its xlog position is ahead of the xlog position on the master. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Fujii Masao <masao.fujii@gmail.com> writes: > In 9.0, walsender reads WAL always from the disk and sends it to the standby. > That is, we cannot send WAL until it has been written (and flushed) to the disk. I believe the above statement to be incorrect: walsender does *not* wait for an fsync to occur. I agree with the idea of trying to read from WAL buffers instead of the file system, but the main reason why is that the current behavior makes FADVISE_DONTNEED for WAL pretty dubious. It'd be a good idea to still (artificially) limit replication to not read ahead of the written-out data. > ... Since we can write and send WAL simultaneously, in synchronous > replication, a transaction commit has only to wait for either of them. So the > performance would significantly increase. That performance claim, frankly, is ludicrous. There is no way that round trip network delay plus write+fsync on the slave is faster than local write+fsync. Furthermore, I would say that you are thinking exactly backwards about the requirements for synchronous replication: what that would mean is that transaction commit waits for *both*, not whichever one finishes first. regards, tom lane
On 06/11/2010 04:31 PM, Tom Lane wrote: > Fujii Masao<masao.fujii@gmail.com> writes: >> In 9.0, walsender reads WAL always from the disk and sends it to the standby. >> That is, we cannot send WAL until it has been written (and flushed) to the disk. > > I believe the above statement to be incorrect: walsender does *not* wait > for an fsync to occur. > > I agree with the idea of trying to read from WAL buffers instead of the > file system, but the main reason why is that the current behavior makes > FADVISE_DONTNEED for WAL pretty dubious. It'd be a good idea to still > (artificially) limit replication to not read ahead of the written-out > data. > >> ... Since we can write and send WAL simultaneously, in synchronous >> replication, a transaction commit has only to wait for either of them. So the >> performance would significantly increase. > > That performance claim, frankly, is ludicrous. There is no way that > round trip network delay plus write+fsync on the slave is faster than > local write+fsync. Furthermore, I would say that you are thinking > exactly backwards about the requirements for synchronous replication: > what that would mean is that transaction commit waits for *both*, > not whichever one finishes first. hmm not sure that is what fujii tried to say - I think his point was that in the original case we would have serialized all the operations (first write+sync on the master, network afterwards and write+sync on the slave) and now we could try parallelizing by sending the wal before we have synced locally. Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > hmm not sure that is what fujii tried to say - I think his point was > that in the original case we would have serialized all the operations > (first write+sync on the master, network afterwards and write+sync on > the slave) and now we could try parallelizing by sending the wal before > we have synced locally. Well, we're already not waiting for fsync, which is the slowest part. If there's a performance problem, it may be because FADVISE_DONTNEED disables kernel buffering so that we're forced to actually read the data back from disk before sending it on down the wire. regards, tom lane
On 06/11/2010 04:47 PM, Tom Lane wrote: > Stefan Kaltenbrunner<stefan@kaltenbrunner.cc> writes: >> hmm not sure that is what fujii tried to say - I think his point was >> that in the original case we would have serialized all the operations >> (first write+sync on the master, network afterwards and write+sync on >> the slave) and now we could try parallelizing by sending the wal before >> we have synced locally. > > Well, we're already not waiting for fsync, which is the slowest part. > If there's a performance problem, it may be because FADVISE_DONTNEED > disables kernel buffering so that we're forced to actually read the data > back from disk before sending it on down the wire. hmm ok - but assuming sync rep we would end up with something like the following(hypotetically assuming each operation takes 1 time unit): originally: write 1 sync 1 network 1 write 1 sync 1 total: 5 whereas in the new case we would basically have the write+sync compete with network+write+sync in parallel(total 3 units) and we would only have to wait for the slower of those two sets of operations instead of the total time of both or am I missing something. Stefan
> Well, we're already not waiting for fsync, which is the slowest part. > If there's a performance problem, it may be because FADVISE_DONTNEED > disables kernel buffering so that we're forced to actually read the data > back from disk before sending it on down the wire. Well, that's fairly direct to solve, no? Just disable FADVISE_DONTNEED if walsenders > 0. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Jun 11, 2010, at 16:31 , Tom Lane wrote: > Fujii Masao <masao.fujii@gmail.com> writes: >> In 9.0, walsender reads WAL always from the disk and sends it to the standby. >> That is, we cannot send WAL until it has been written (and flushed) to the disk. > > I believe the above statement to be incorrect: walsender does *not* wait > for an fsync to occur. Hm, but then Robert's failure case is real, and streaming replication might break due to an OS-level crash of the master.Or am I missing something? best regards, Florian Pflug
> Hm, but then Robert's failure case is real, and streaming replication might break due to an OS-level crash of the master.Or am I missing something? Well, in the failover case this isn't a problem, it's a benefit: the standby gets a transaction which you would have lost off the master. However, I can see this as a problem in the event of a server-room powerout with very bad timing where there isn't a failover to the standby: 1) Master goes out 2) "floating" transaction applied to standby. 3) Standby goes out 4) Power back on 5) master comes up 6) standby comes up It seems like, in that sequence, the standby would have one transaction which the master doesn't have, yet the standby thinks it can continue getting WAL from the master. Or did I miss something which makes this impossible? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Jun 12, 2010, at 3:10 , Josh Berkus wrote: >> Hm, but then Robert's failure case is real, and streaming replication might break due to an OS-level crash of the master.Or am I missing something? > > 1) Master goes out > 2) "floating" transaction applied to standby. > 3) Standby goes out > 4) Power back on > 5) master comes up > 6) standby comes up > > It seems like, in that sequence, the standby would have one transaction > which the master doesn't have, yet the standby thinks it can continue > getting WAL from the master. Or did I miss something which makes this > impossible? I did indeed miss something - with wal_sync_method set to either open_datasync or open_sync, all written WAL is also synced.Since open_datasync is the preferred setting according to http://www.postgresql.org/docs/9.0/static/runtime-config-wal.html#GUC-WAL-SYNC-METHOD,systems supporting open_datasync shouldbe safe. My Ubuntu 10.04 box running postgres 8.4.4 doesn't support open_datasync though, and hence defaults to fdatasync. Probablybecause of this fragment in xlogdefs.h #if O_DSYNC != BARE_OPEN_SYNC_FLAG #define OPEN_DATASYNC_FLAG (O_DSYNC | PG_O_DIRECT) #endif glibc defines O_DSYNC as an alias for O_SYNC and warrants that with "Most Linux filesystems don't actually implement the POSIX O_SYNC semantics, which require all metadata updates of a writeto be on disk on returning to userspace, but only the O_DSYNC semantics, which require only actual file data and metadatanecessary to retrieve it to be on disk by the time the system call returns." If that is true, I believe we should default to open_sync, not fdatasync if open_datasync isn't available, at least on linux. best regards, Florian Pflug
On 12/06/10 01:16, Josh Berkus wrote: > >> Well, we're already not waiting for fsync, which is the slowest part. >> If there's a performance problem, it may be because FADVISE_DONTNEED >> disables kernel buffering so that we're forced to actually read the data >> back from disk before sending it on down the wire. > > Well, that's fairly direct to solve, no? Just disable FADVISE_DONTNEED > if walsenders> 0. We already do that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Florian Pflug wrote: > glibc defines O_DSYNC as an alias for O_SYNC and warrants that with > "Most Linux filesystems don't actually implement the POSIX O_SYNC semantics, which require all metadata updates of a writeto be on disk on returning to userspace, but only the O_DSYNC semantics, which require only actual file data and metadatanecessary to retrieve it to be on disk by the time the system call returns." > > If that is true, I believe we should default to open_sync, not fdatasync if open_datasync isn't available, at least onlinux. > It's not true, because Linux O_SYNC semantics are basically that it's never worked reliably on ext3. See http://archives.postgresql.org/pgsql-hackers/2007-10/msg01310.php for example of how terrible the situation would be if O_SYNC were the default on Linux. We just got a report that a better O_DSYNC is now properly exposed starting on kernel 2.6.33+glibc 2.12: http://archives.postgresql.org/message-id/201006041539.03868.cousinmarc@gmail.com and it's possible they may have finally fixed it so it work like it's supposed to. PostgreSQL versions compiled against the right prerequisites will default to O_DSYNC by themselves. Whether or not this is a good thing has yet to be determined. The last thing we'd want to do at this point is make the old and usually broken O_SYNC behavior suddenly preferred, when the new and possibly fixed O_DSYNC one will be automatically selected when available without any code changes on the database side. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Fri, Jun 11, 2010 at 11:24 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I think the failover case might be OK. But if the master crashes and > restarts, the slave might be left thinking its xlog position is ahead > of the xlog position on the master. Right. Unless we perform a failover in this case, the standby might go down because of inconsistency of WAL after restarting the master. To avoid this problem, walsender must wait for WAL to be not only written but also *fsynced* on the master before sending it as 9.0 does. Though this would degrade the performance, this might be useful for some cases. We should provide the knob to specify whether to allow the standby to go ahead of the master or not? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Jun 11, 2010 at 11:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> hmm not sure that is what fujii tried to say - I think his point was >> that in the original case we would have serialized all the operations >> (first write+sync on the master, network afterwards and write+sync on >> the slave) and now we could try parallelizing by sending the wal before >> we have synced locally. > > Well, we're already not waiting for fsync, which is the slowest part. No, currently walsender waits for fsync. Walsender tries to send WAL up to xlogctl->LogwrtResult.Write. OTOH, xlogctl->LogwrtResult.Write is updated after XLogWrite() performs fsync. As the result, walsender cannot send WAL not fsynced yet. We should update xlogctl->LogwrtResult.Write before XLogWrite() performs fsync for 9.0? But that change would cause the problem that Robert pointed out. http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php > If there's a performance problem, it may be because FADVISE_DONTNEED > disables kernel buffering so that we're forced to actually read the data > back from disk before sending it on down the wire. Currently, if max_wal_senders > 0, POSIX_FADV_DONTNEED is not used for WAL files at all. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Sat, Jun 12, 2010 at 12:15 AM, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: > hmm ok - but assuming sync rep we would end up with something like the > following(hypotetically assuming each operation takes 1 time unit): > > originally: > > write 1 > sync 1 > network 1 > write 1 > sync 1 > > total: 5 > > whereas in the new case we would basically have the write+sync compete with > network+write+sync in parallel(total 3 units) and we would only have to wait > for the slower of those two sets of operations instead of the total time of > both or am I missing something. Yeah, this is what I'd like to say. Thanks! Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Jun 14, 2010 at 4:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Fri, Jun 11, 2010 at 11:24 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I think the failover case might be OK. But if the master crashes and >> restarts, the slave might be left thinking its xlog position is ahead >> of the xlog position on the master. > > Right. Unless we perform a failover in this case, the standby might go down > because of inconsistency of WAL after restarting the master. To avoid this > problem, walsender must wait for WAL to be not only written but also *fsynced* > on the master before sending it as 9.0 does. Though this would degrade the > performance, this might be useful for some cases. We should provide the knob > to specify whether to allow the standby to go ahead of the master or not? Maybe. That sounds like a pretty enormous foot-gun to me, considering that we have no way of recovering from the situation where the standby gets ahead of the master. Right now, I believe we're still in the situation where the standby goes into an infinite CPU-chewing, log-spewing loop, but even after we fix that it's not going to be good enough to really handle that case sensibly, which we probably need to do if we want to make this change. Come to think of it, can this happen already? Can the master stream WAL to the standby after it's written but before it's fsync'd? We should get the open item fixed for 9.0 here before we start worrying about 9.1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Mon, 2010-06-14 at 17:39 +0900, Fujii Masao wrote: > On Fri, Jun 11, 2010 at 11:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > >> hmm not sure that is what fujii tried to say - I think his point was > >> that in the original case we would have serialized all the operations > >> (first write+sync on the master, network afterwards and write+sync on > >> the slave) and now we could try parallelizing by sending the wal before > >> we have synced locally. > > > > Well, we're already not waiting for fsync, which is the slowest part. > > No, currently walsender waits for fsync. > > Walsender tries to send WAL up to xlogctl->LogwrtResult.Write. OTOH, > xlogctl->LogwrtResult.Write is updated after XLogWrite() performs fsync. > As the result, walsender cannot send WAL not fsynced yet. We should > update xlogctl->LogwrtResult.Write before XLogWrite() performs fsync > for 9.0? > > But that change would cause the problem that Robert pointed out. > http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php ISTM you just defined some clear objectives for next work. Copying the data from WAL buffers is mostly irrelevant. The majority of time is lost waiting for fsync. The biggest issue is about how to allow WAL write and WALSender to act concurrently and have backend wait for both. Sure, copying data from wal_buffers will be faster still, but it will cause you to address some subtle data structure locking operations that we could solve at a later time. And it still gives the problem of how the master resets itself if the standby really is ahead. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Mon, 2010-06-14 at 17:39 +0900, Fujii Masao wrote: > No, currently walsender waits for fsync. > ... > But that change would cause the problem that Robert pointed out. > http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php Presumably this means that if synchronous_commit = off on primary that SR in 9.0 will no longer work correctly if the primary crashes? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Mon, Jun 14, 2010 at 8:10 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Maybe. That sounds like a pretty enormous foot-gun to me, considering > that we have no way of recovering from the situation where the standby > gets ahead of the master. No, we can do that by reconstructing the standby from the backup. And, that situation is not a problem for users including me who prefer to perform a failover when the master goes down. Of course, we can just restart the master in that case, but it's likely to take longer than a failover because there would be a cause of the crash. For example, if the master goes down because of a media crash, the master would never start up unless PITR is performed. So I'm not sure how many users prefer a restart to a failover. > We should get the open item fixed for 9.0 here before we start > worrying about 9.1. Yep, so I was submitting some patches in these days :) Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Jun 14, 2010 at 8:41 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Mon, Jun 14, 2010 at 8:10 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Maybe. That sounds like a pretty enormous foot-gun to me, considering >> that we have no way of recovering from the situation where the standby >> gets ahead of the master. > > No, we can do that by reconstructing the standby from the backup. > > And, that situation is not a problem for users including me who prefer to > perform a failover when the master goes down. You don't get to pick - if a backend crashes on the master, it will restart right away and come up, but the slave will now be hosed... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Fujii Masao <masao.fujii@gmail.com> writes: > On Fri, Jun 11, 2010 at 11:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Well, we're already not waiting for fsync, which is the slowest part. > No, currently walsender waits for fsync. No, you're mistaken. > Walsender tries to send WAL up to xlogctl->LogwrtResult.Write. OTOH, > xlogctl->LogwrtResult.Write is updated after XLogWrite() performs fsync. Wrong. LogwrtResult.Write tracks how far we've written out data, but it is only (known to be) fsync'd as far as LogwrtResult.Flush. > But that change would cause the problem that Robert pointed out. > http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php Yes. Possibly walsender should only be allowed to send as far as LogwrtResult.Flush. regards, tom lane
On Mon, Jun 14, 2010 at 10:13 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jun 14, 2010 at 8:41 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Mon, Jun 14, 2010 at 8:10 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> Maybe. That sounds like a pretty enormous foot-gun to me, considering >>> that we have no way of recovering from the situation where the standby >>> gets ahead of the master. >> >> No, we can do that by reconstructing the standby from the backup. >> >> And, that situation is not a problem for users including me who prefer to >> perform a failover when the master goes down. > > You don't get to pick - if a backend crashes on the master, it will > restart right away and come up, but the slave will now be hosed... You are concerned about the case where postmaster automatically restarts the crash recovery, in particular? Yes, this case is more problematic. If the standby is ahead of the master, the standby might find an invalid record and run into the infinite retry loop, or keep working without noticing the inconsistency between the database and the WAL. I'm thinking that walreceiver should throw a PANIC when it receives the record which is in the LSN older than the last WAL receive location, except the beginning of streaming (because the standby always requests for streaming from the starting of WAL file at first even if some records have already been received in previous time). Thought? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Jun 15, 2010 at 12:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Fujii Masao <masao.fujii@gmail.com> writes: >> On Fri, Jun 11, 2010 at 11:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Well, we're already not waiting for fsync, which is the slowest part. > >> No, currently walsender waits for fsync. > > No, you're mistaken. > >> Walsender tries to send WAL up to xlogctl->LogwrtResult.Write. OTOH, >> xlogctl->LogwrtResult.Write is updated after XLogWrite() performs fsync. > > Wrong. LogwrtResult.Write tracks how far we've written out data, > but it is only (known to be) fsync'd as far as LogwrtResult.Flush. Hmm.. I agree that xlogctl->LogwrtResult.Write indicates the byte position we've written. But in the current XLogWrite() code, it's updated after XLogWrite() calls issue_xlog_fsync(). No? Of course, the backend-local LogwrtResult.Write is updated before issue_xlog_fsync(), but it's not available by walsender. Am I missing something? >> But that change would cause the problem that Robert pointed out. >> http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php > > Yes. Possibly walsender should only be allowed to send as far as > LogwrtResult.Flush. Yes, in order to avoid that problem, walsender should wait for WAL to be fsync'd before sending it. But I'm worried that this would slow down the performance on the master significantly because WAL flush and WAL streaming are not performed concurrently and the backend must wait for both in a serial manner. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 15/06/10 07:47, Fujii Masao wrote: > On Tue, Jun 15, 2010 at 12:02 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> Fujii Masao<masao.fujii@gmail.com> writes: >>> Walsender tries to send WAL up to xlogctl->LogwrtResult.Write. OTOH, >>> xlogctl->LogwrtResult.Write is updated after XLogWrite() performs fsync. >> >> Wrong. LogwrtResult.Write tracks how far we've written out data, >> but it is only (known to be) fsync'd as far as LogwrtResult.Flush. > > Hmm.. I agree that xlogctl->LogwrtResult.Write indicates the byte position > we've written. But in the current XLogWrite() code, it's updated after > XLogWrite() calls issue_xlog_fsync(). No? issue_xlog_fsync() is only called if the caller requested a flush by advancing WriteRqst.Flush. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Jun 15, 2010 at 2:16 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 15/06/10 07:47, Fujii Masao wrote: >> >> On Tue, Jun 15, 2010 at 12:02 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >>> >>> Fujii Masao<masao.fujii@gmail.com> writes: >>>> >>>> Walsender tries to send WAL up to xlogctl->LogwrtResult.Write. OTOH, >>>> xlogctl->LogwrtResult.Write is updated after XLogWrite() performs fsync. >>> >>> Wrong. LogwrtResult.Write tracks how far we've written out data, >>> but it is only (known to be) fsync'd as far as LogwrtResult.Flush. >> >> Hmm.. I agree that xlogctl->LogwrtResult.Write indicates the byte position >> we've written. But in the current XLogWrite() code, it's updated after >> XLogWrite() calls issue_xlog_fsync(). No? > > issue_xlog_fsync() is only called if the caller requested a flush by > advancing WriteRqst.Flush. True. The scenario that I'm concerned about is: 1. A transaction commit causes XLogFlush() to write *and* fsync WAL up to the commit record. 2. XLogFlush() calls XLogWrite(), and xlogctl->LogwrtResult.Write is updated to indicate the LSN bigger than or equal tothat of the commit record after XLogWrite() calls issue_xlog_fsync(). 3. Then walsender can send WAL up to the commit record. A transaction commit would need to wait for local fsync and replication in a serial manner, in synchronous replication. IOW, walsender cannot send the commit record until it's fsync'd in XLogWrite(). This scenario will not happen? Am I missing something? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Jun 15, 2010 at 12:46 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Mon, Jun 14, 2010 at 10:13 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Jun 14, 2010 at 8:41 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> On Mon, Jun 14, 2010 at 8:10 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> Maybe. That sounds like a pretty enormous foot-gun to me, considering >>>> that we have no way of recovering from the situation where the standby >>>> gets ahead of the master. >>> >>> No, we can do that by reconstructing the standby from the backup. >>> >>> And, that situation is not a problem for users including me who prefer to >>> perform a failover when the master goes down. >> >> You don't get to pick - if a backend crashes on the master, it will >> restart right away and come up, but the slave will now be hosed... > > You are concerned about the case where postmaster automatically restarts > the crash recovery, in particular? Yes, this case is more problematic. > If the standby is ahead of the master, the standby might find an invalid > record and run into the infinite retry loop, or keep working without > noticing the inconsistency between the database and the WAL. > > I'm thinking that walreceiver should throw a PANIC when it receives the > record which is in the LSN older than the last WAL receive location, > except the beginning of streaming (because the standby always requests > for streaming from the starting of WAL file at first even if some records > have already been received in previous time). Thought? Yeah, that seems like it would be a good safety check. I wonder if it would be possible to jigger things so that we send the WAL to the standby as soon as it is generated, but somehow arrange things so that the standby knows the last location that the master has fsync'd and never applies beyond that point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Jun 15, 2010, at 10:45 , Fujii Masao wrote: > A transaction commit would need to wait for local fsync and replication > in a serial manner, in synchronous replication. IOW, walsender cannot > send the commit record until it's fsync'd in XLogWrite(). Hm, but since 9.0 won't do synchronous replication anyway, the right thing to do for 9.0 is still to send only fsync'ed WAL,no? Without synchronous replication the overhead seems negligible. For synchronous replication (and hence for 9.1) I think there are two basic options a) Stream only fsync'ed WAL, like in the asynchronous case. Depending on policy, additionally wait for one or more slavesto fsync before reporting success. b) Stream non-fsync'ed WAL. on COMMIT, wait for at last one node (not necessarily the master, exact count depends on policy)to fsync before reporting success. During recovery of the master, recover up to the latest LSN found on any one ofthe nodes. Option (b) requires some additional thought, though. Controlled removal of slave nodes and concurrent crashes of more thanone node are the most difficult areas to handle gracefully, it seems. best regards, Florian Pflug
> I wonder if it would be possible to jigger things so that we send the > WAL to the standby as soon as it is generated, but somehow arrange > things so that the standby knows the last location that the master has > fsync'd and never applies beyond that point. I can't think of any way which would not require major engineering. And you'd be slowing down replication *in general* to deal with a fairly unlikely corner case. I think the panic is the way to go. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Tue, Jun 15, 2010 at 3:57 PM, Josh Berkus <josh@agliodbs.com> wrote: >> I wonder if it would be possible to jigger things so that we send the >> WAL to the standby as soon as it is generated, but somehow arrange >> things so that the standby knows the last location that the master has >> fsync'd and never applies beyond that point. > > I can't think of any way which would not require major engineering. And > you'd be slowing down replication *in general* to deal with a fairly > unlikely corner case. > > I think the panic is the way to go. I have yet to convince myself of how likely this is to occur. I tried to reproduce this issue by crashing the database, but I think in 9.0 you need an actual operating system crash to cause this problem, and I haven't yet set up an environment in which I can repeatedly crash the OS. I believe, though, that in 9.1, we're going to want to stream from WAL buffers as proposed in the patch that started out this thread, and then I think this issue can be triggered with just a database crash. In 9.0, I think we can fix this problem by (1) only streaming WAL that has been fsync'd and (2) PANIC-ing if the problem occurs anyway. But in 9.1, with sync rep and the performance demands that entails, I think that we're going to need to rethink it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
> I have yet to convince myself of how likely this is to occur. I tried > to reproduce this issue by crashing the database, but I think in 9.0 > you need an actual operating system crash to cause this problem, and I > haven't yet set up an environment in which I can repeatedly crash the > OS. I believe, though, that in 9.1, we're going to want to stream > from WAL buffers as proposed in the patch that started out this > thread, and then I think this issue can be triggered with just a > database crash. Yes, but it still requires: a) the master must crash with at least one transaction transmitted to the slave an not yet fsync'd b) the slave must not crash as well c) the master must come back up without the slave ever having been promoted to master Note that (a) is fairly improbable to begin with due to both our batching transactions into bundles for transmission, and network latency vs. disk latency. So, is it possible? Yes. Will it happen anywhere but the highest-txn-rate sites one in 10,000 times? No. This means that we should look for a solution which does not penalize the common case in order to close a very improbable hole, if such a solution exists. > In 9.0, I think we can fix this problem by (1) only streaming WAL that > has been fsync'd and I don't think this is the best solution; it would be a noticeable performance penalty on replication. It also would potentially result in data loss for the user; if the user fails over to the slave in the corner case, they can "rescue" the in-flight transaction. At the least, this would need to become Yet Another Configuration Option. >(2) PANIC-ing if the problem occurs anyway. The question is, is detecting out-of-order WAL records *sufficient* to detect a failure? I'm thinking there are possible sequences where there would be no out-of-sequence, but the slave would still have a transaction the master doesn't, which the user wouldn't know until a page update corrupts their data. > But > in 9.1, with sync rep and the performance demands that entails, I > think that we're going to need to rethink it. All the more reason to avoid dealing with it now, if we can. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On 6/15/10 5:09 PM, Josh Berkus wrote: >> > In 9.0, I think we can fix this problem by (1) only streaming WAL that >> > has been fsync'd and > > I don't think this is the best solution; it would be a noticeable > performance penalty on replication. Actually, there's an even bigger reason not to mandate waiting for fsync: what if the user turns fsync off? One can certainly imagine users choosing to rely on their replication slaves for crash recovery instead of fsync. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Tue, Jun 15, 2010 at 8:09 PM, Josh Berkus <josh@agliodbs.com> wrote: > >> I have yet to convince myself of how likely this is to occur. I tried >> to reproduce this issue by crashing the database, but I think in 9.0 >> you need an actual operating system crash to cause this problem, and I >> haven't yet set up an environment in which I can repeatedly crash the >> OS. I believe, though, that in 9.1, we're going to want to stream >> from WAL buffers as proposed in the patch that started out this >> thread, and then I think this issue can be triggered with just a >> database crash. > > Yes, but it still requires: > > a) the master must crash with at least one transaction transmitted to > the slave an not yet fsync'd Bzzzzt. Stop right there. It only requires the master to crash with at least one *WAL record* written but not transmitted, not one transaction. And most WAL record types are not fsync'd immediately. So in theory I think that, for example, an OS crash in the middle of a big bulk insert operation should be sufficient to trigger this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Wed, Jun 16, 2010 at 5:06 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jun 15, 2010 at 3:57 PM, Josh Berkus <josh@agliodbs.com> wrote: >>> I wonder if it would be possible to jigger things so that we send the >>> WAL to the standby as soon as it is generated, but somehow arrange >>> things so that the standby knows the last location that the master has >>> fsync'd and never applies beyond that point. >> >> I can't think of any way which would not require major engineering. And >> you'd be slowing down replication *in general* to deal with a fairly >> unlikely corner case. >> >> I think the panic is the way to go. > > I have yet to convince myself of how likely this is to occur. I tried > to reproduce this issue by crashing the database, but I think in 9.0 > you need an actual operating system crash to cause this problem, and I > haven't yet set up an environment in which I can repeatedly crash the > OS. I believe, though, that in 9.1, we're going to want to stream > from WAL buffers as proposed in the patch that started out this > thread, and then I think this issue can be triggered with just a > database crash. > > In 9.0, I think we can fix this problem by (1) only streaming WAL that > has been fsync'd and (2) PANIC-ing if the problem occurs anyway. But > in 9.1, with sync rep and the performance demands that entails, I > think that we're going to need to rethink it. The problem is not that the master streams non-fsync'd WAL, but that the standby can replay that. So I'm thinking that we can send non-fsync'd WAL safely if the standby makes the recovery wait until the master has fsync'd WAL. That is, walsender sends not only non-fsync'd WAL but also WAL flush location to walreceiver, and the standby applies only the WAL which the master has already fsync'd. Thought? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 21/06/10 12:08, Fujii Masao wrote: > On Wed, Jun 16, 2010 at 5:06 AM, Robert Haas<robertmhaas@gmail.com> wrote: >> In 9.0, I think we can fix this problem by (1) only streaming WAL that >> has been fsync'd and (2) PANIC-ing if the problem occurs anyway. But >> in 9.1, with sync rep and the performance demands that entails, I >> think that we're going to need to rethink it. > > The problem is not that the master streams non-fsync'd WAL, but that the > standby can replay that. So I'm thinking that we can send non-fsync'd WAL > safely if the standby makes the recovery wait until the master has fsync'd > WAL. That is, walsender sends not only non-fsync'd WAL but also WAL flush > location to walreceiver, and the standby applies only the WAL which the > master has already fsync'd. Thought? I guess, but you have to be very careful to correctly refrain from applying the WAL. For example, a naive implementation might write the WAL to disk in walreceiver immediately, but refrain from telling the startup process about it. If walreceiver is then killed because the connection is broken (and it will be because the master just crashed), the startup process will read the streamed WAL from the file in pg_xlog, and go ahead to apply it anyway. So maybe there's some room for optimization there, but given the round-trip required for the acknowledgment anyway it might not buy you much, and the implementation is not very straightforward. This is clearly 9.1 material, if worth optimizing at all. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, Jun 21, 2010 at 10:40 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > I guess, but you have to be very careful to correctly refrain from applying > the WAL. For example, a naive implementation might write the WAL to disk in > walreceiver immediately, but refrain from telling the startup process about > it. If walreceiver is then killed because the connection is broken (and it > will be because the master just crashed), the startup process will read the > streamed WAL from the file in pg_xlog, and go ahead to apply it anyway. So the goal is that when you *do* failover to the standby it replays these additional records. So whether the startup process obeys this limit would have to be conditional on whether it's still in standby mode. > So maybe there's some room for optimization there, but given the round-trip > required for the acknowledgment anyway it might not buy you much, and the > implementation is not very straightforward. This is clearly 9.1 material, if > worth optimizing at all. I don't see any need for a round-trip acknowledgement -- no more than currently. the master just includes the flush location in every response. It might have to send additional responses though when fsyncs happen to update the flush location even if no additional records are sent. Otherwise a hot standby might spend a long time with out-dated data even if on failover it would be up to date that seems nonideal for the hot standby users. I think this would be a good improvement for databases processing large batch updates so the standby doesn't have an increased risk of losing a large amount of data if there's a crash after processing such a large query. I agree it's 9.1 material. Earlier we made a change to the WAL streaming protocol on the basis that we wanted to get the protocol right even if we don't use the change right away. I'm not sure I understand that -- it's not like we're going to stream WAL from 9.0 to 9.1. But if that was true then perhaps we need to add the WAL flush location to the protocol now even if we're not going to use yet? -- greg
On Mon, 2010-06-21 at 18:08 +0900, Fujii Masao wrote: > The problem is not that the master streams non-fsync'd WAL, but that the > standby can replay that. So I'm thinking that we can send non-fsync'd WAL > safely if the standby makes the recovery wait until the master has fsync'd > WAL. That is, walsender sends not only non-fsync'd WAL but also WAL flush > location to walreceiver, and the standby applies only the WAL which the > master has already fsync'd. Thought? Yes, good thought. The patch just applied seems too much. I had the same thought, though it would mean you'd need to send two xlog end locations, one for write, one for fsync. Though not really clear why we send the "current end of WAL on the server" anyway, so maybe we can just alter that. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
Simon Riggs wrote: > On Mon, 2010-06-21 at 18:08 +0900, Fujii Masao wrote: > > > The problem is not that the master streams non-fsync'd WAL, but that the > > standby can replay that. So I'm thinking that we can send non-fsync'd WAL > > safely if the standby makes the recovery wait until the master has fsync'd > > WAL. That is, walsender sends not only non-fsync'd WAL but also WAL flush > > location to walreceiver, and the standby applies only the WAL which the > > master has already fsync'd. Thought? > > Yes, good thought. The patch just applied seems too much. > > I had the same thought, though it would mean you'd need to send two xlog > end locations, one for write, one for fsync. Though not really clear why > we send the "current end of WAL on the server" anyway, so maybe we can > just alter that. Is this a TODO? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. +
On Tue, Jun 29, 2010 at 10:06 PM, Bruce Momjian <bruce@momjian.us> wrote: > Simon Riggs wrote: >> On Mon, 2010-06-21 at 18:08 +0900, Fujii Masao wrote: >> >> > The problem is not that the master streams non-fsync'd WAL, but that the >> > standby can replay that. So I'm thinking that we can send non-fsync'd WAL >> > safely if the standby makes the recovery wait until the master has fsync'd >> > WAL. That is, walsender sends not only non-fsync'd WAL but also WAL flush >> > location to walreceiver, and the standby applies only the WAL which the >> > master has already fsync'd. Thought? >> >> Yes, good thought. The patch just applied seems too much. >> >> I had the same thought, though it would mean you'd need to send two xlog >> end locations, one for write, one for fsync. Though not really clear why >> we send the "current end of WAL on the server" anyway, so maybe we can >> just alter that. > > Is this a TODO? Maybe. As Heikki pointed out upthread, the standby can't even write the WAL to back to the OS until it's been fsync'd on the master without risking the problem under discussion. So we can stream the WAL from master to standby as long as the standby just buffers it in memory (or somewhere other than the usual location in pg_xlog). Before we get too busy frobnicating this gonkulator, I'd like to see a little more discussion of what kind of performance people are expecting from sync rep. Sounds to me like the best we can expect here is, on every commit: (a) wait for master fsync to complete, (b) send message to standby, (c) wait for reply for reply from standby indicating that fsync is complete on standby. Even assuming that the network overhead is minimal, that halves the commit rate. Are the people who want sync rep OK with that? Is there any way to do better? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Wed, Jun 30, 2010 at 11:26 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Maybe. As Heikki pointed out upthread, the standby can't even write > the WAL to back to the OS until it's been fsync'd on the master > without risking the problem under discussion. If we change the startup process so that it doesn't go ahead of the master's fsync location even after the walreceiver is terminated, we would have no need to worry about that risk. For further robustness, the walreceiver might be able to zero the WAL records which have not been fsync'd on the master yet, when being terminated. But, if the standby crashes after the master crashes, restart of the standby might replay that non-fsync'd WAL wrongly because it cannot remember the master's fsync location. In this case, if we promote the standby to the master, we still don't have to worry about that risk. But instead of performing a failover, if we restart the master and make the standby connect to the master again, the database on the standby would get corrupted. For now, I don't have good idea to avoid that database corruption by the double failure (crash of both master and standby)... > So we can stream the > WAL from master to standby as long as the standby just buffers it in > memory (or somewhere other than the usual location in pg_xlog). Yeah, I was just thinking the same thing. But the problem is that the buffer size might become too big (might be bigger than 16MB). For example, synchronous_commit = off and wal_writer_delay = 10000ms on the master would delay the fsync significantly and increase the buffer size on the standby. > Before we get too busy frobnicating this gonkulator, I'd like to see a > little more discussion of what kind of performance people are > expecting from sync rep. Sounds to me like the best we can expect > here is, on every commit: (a) wait for master fsync to complete, (b) > send message to standby, (c) wait for reply for reply from standby > indicating that fsync is complete on standby. Even assuming that the > network overhead is minimal, that halves the commit rate. Are the > people who want sync rep OK with that? Is there any way to do better? (c) would depend on the synchronization mode the user chooses: #1 Wait for WAL to be received by the standby #2 Wait for WAL to be received and flushed by the standby #3 Wait for WALto be received, flushed and replayed by the standby (a) would depend on synchronous_commit. Personally I'm interested in disabling synchronous_commit on the master and choosing #1 as the sync mode. Though this may be very optimistic configuration :) The point for performance of sync rep is to parallelize (a) and (b)+(c), I think. If they are performed in a serial manner, the performance overhead on the master would become high. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Jun 30, 2010 at 5:36 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> Before we get too busy frobnicating this gonkulator, I'd like to see a >> little more discussion of what kind of performance people are >> expecting from sync rep. Sounds to me like the best we can expect >> here is, on every commit: (a) wait for master fsync to complete, (b) >> send message to standby, (c) wait for reply for reply from standby >> indicating that fsync is complete on standby. Even assuming that the >> network overhead is minimal, that halves the commit rate. Are the >> people who want sync rep OK with that? Is there any way to do better? > > (c) would depend on the synchronization mode the user chooses: > > #1 Wait for WAL to be received by the standby > #2 Wait for WAL to be received and flushed by the standby > #3 Wait for WAL to be received, flushed and replayed by the standby > > (a) would depend on synchronous_commit. Personally I'm interested in > disabling synchronous_commit on the master and choosing #1 as the sync > mode. Though this may be very optimistic configuration :) > > The point for performance of sync rep is to parallelize (a) and (b)+(c), > I think. If they are performed in a serial manner, the performance > overhead on the master would become high. Right. So we to try to come up with a design that permits that, which must be robust in the face of any number of crashes on the two machines, in any order. Until we have that, we're just going around in circles. One thought that occurred to me is that if the master and standby were more tightly coupled, you could recover after a crash by making the one with the further-advanced WAL position the master, and the other one the standby. That would get around this problem, though at the cost of considerable additional complexity. But then if one of the servers comes up and can't talk to the other, you need some mechanism for preventing split-brain syndrome. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Wed, Jun 30, 2010 at 12:37 PM, Robert Haas <robertmhaas@gmail.com> wrote: > One thought that occurred to me is that if the master and standby were > more tightly coupled, you could recover after a crash by making the > one with the further-advanced WAL position the master, and the other > one the standby. That would get around this problem, though at the > cost of considerable additional complexity. But then if one of the > servers comes up and can't talk to the other, you need some mechanism > for preventing split-brain syndrome. Users should be free to build infrastructure to allow that. But we can't just switch ourselves -- we don't know what other pieces of their systems need to be updated when the master changes. We also need to stop thinking in terms of one master and one slave. They could have dozens of slaves and in case of failover would want to pick the slave with the most recent WAL position. The way I picture that happening they're monitoring all their slaves in some monitoring tool and use that data to pick the new master. Some external tool picks the new master and tells that host, all the other slaves, and all the rest of the their infrastructure where to find the new master and does whatever is necessary to restart or reload configurations. The question I think is what interfaces do we need in Postgres to make this easy. The monitoring tool needs a way to find the current WAL position from the slaves even when the master is down. That means potentially needing to start up the slaves in read-only mode with no master at all. It also means making it easy for an external tool to switch a node from slave to primary and change a slave's master. And it also means a slave should be able to change master and pick up where it left off easily. I'm not sure what the recommended interfaces for these operations would be currently for an external tool. -- greg
On Fri, Jun 11, 2010 at 9:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > In 9.0, walsender reads WAL always from the disk and sends it to the standby. > That is, we cannot send WAL until it has been written (and flushed) to the disk. > This degrades the performance of synchronous replication very much since a > transaction commit must wait for the WAL write time *plus* the replication time. > > The attached patch enables walsender to read data from WAL buffers in addition > to the disk. Since we can write and send WAL simultaneously, in synchronous > replication, a transaction commit has only to wait for either of them. So the > performance would significantly increase. To recap the previous discussion on this thread, we ended up changing the behavior of 9.0 so that it only sends WAL which has been written to the OS *and flushed*, because sending unflushed WAL to the standby is unsafe. The standby can get ahead of the master while still believing that the databases are in sync, due to the fact that after an SR reconnect we rewind to the start of the current WAL segment. This results in a silently corrupt standby database. If it's unsafe to send written but unflushed WAL to the standby, then for the same reasons we can't send unwritten WAL either. Therefore, I believe that this entire patch in its current form is a nonstarter and we should mark it Rejected in the CF app so that reviewers don't unnecessarily spend time on it. Having said that, I do think we urgently need some high-level design discussion on how sync rep is actually going to handle this issue (perhaps on a new thread). If we can't resolve this issue, sync rep is going to be really slow; but there are no easy solutions to this problem in sight, so if we want to have sync rep for 9.1 we'd better agree on one of the difficult solutions soon so that work can begin. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > If it's unsafe to send written but unflushed WAL to the standby, then > for the same reasons we can't send unwritten WAL either. [...] > Having said that, I do think we urgently need some high-level design > discussion on how sync rep is actually going to handle this issue Stop me if I'm all wrong already, but I though we said that we should handle this case by decoupling what we can send to the standby and what it can apply. We could do this by sending the current WAL fsync'ed position on the master in the WAL sender protocol, either in the WAL itself or as out-of-bound messages, I guess. Now, this can be made safe, how to make it fast (low-latency) is yet to be addressed. Regards, -- dim
On Wed, Jul 7, 2010 at 4:40 AM, Dimitri Fontaine <dfontaine@hi-media.com> wrote: > Stop me if I'm all wrong already, but I though we said that we should > handle this case by decoupling what we can send to the standby and what > it can apply. We could do this by sending the current WAL fsync'ed > position on the master in the WAL sender protocol, either in the WAL > itself or as out-of-bound messages, I guess. > > Now, this can be made safe, how to make it fast (low-latency) is yet to > be addressed. Yeah, that's the trick, isn't it? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Dimitri Fontaine <dfontaine@hi-media.com> writes: > Stop me if I'm all wrong already, but I though we said that we should > handle this case by decoupling what we can send to the standby and what > it can apply. What's the point of that? It won't make the standby apply any faster. What it will do is make the protocol more complicated, hence slower (more messages) and more at risk of bugs. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Dimitri Fontaine <dfontaine@hi-media.com> writes: >> Stop me if I'm all wrong already, but I though we said that we should >> handle this case by decoupling what we can send to the standby and what >> it can apply. > > What's the point of that? It won't make the standby apply any faster. True, but it allows to send the WAL content before to ack its fsync. Regards. -- dim
On 7/6/10 4:44 PM, Robert Haas wrote: > To recap the previous discussion on this thread, we ended up changing > the behavior of 9.0 so that it only sends WAL which has been written > to the OS *and flushed*, because sending unflushed WAL to the standby > is unsafe. The standby can get ahead of the master while still > believing that the databases are in sync, due to the fact that after > an SR reconnect we rewind to the start of the current WAL segment. > This results in a silently corrupt standby database. What was the final decision on behavior if fsync=off? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Wed, Jul 7, 2010 at 6:44 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 7/6/10 4:44 PM, Robert Haas wrote: >> To recap the previous discussion on this thread, we ended up changing >> the behavior of 9.0 so that it only sends WAL which has been written >> to the OS *and flushed*, because sending unflushed WAL to the standby >> is unsafe. The standby can get ahead of the master while still >> believing that the databases are in sync, due to the fact that after >> an SR reconnect we rewind to the start of the current WAL segment. >> This results in a silently corrupt standby database. > > What was the final decision on behavior if fsync=off? I'm not sure we made any decision, per se, but if you use fsync=off in combination with SR and experience an unexpected crash-and-reboot on the master, you will be sad. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
> Having said that, I do think we urgently need some high-level design > discussion on how sync rep is actually going to handle this issue > (perhaps on a new thread). If we can't resolve this issue, sync rep > is going to be really slow; but there are no easy solutions to this > problem in sight, so if we want to have sync rep for 9.1 we'd better > agree on one of the difficult solutions soon so that work can begin. > When standbys reconnect after a crash, they could send the ahead-of-the-master WAL to the master. This is an alternative to choosing the most-ahead standby as the new master, as suggested elsewhere. Greetings Marcin Mańk
On Thu, Jul 8, 2010 at 7:55 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> What was the final decision on behavior if fsync=off? > > I'm not sure we made any decision, per se, but if you use fsync=off in > combination with SR and experience an unexpected crash-and-reboot on > the master, you will be sad. True. But, without SR, an unexpected crash-and-reboot in the master would make you sad ;) So I'm not sure whether we really need to take action for the case of SR + fsync=off. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center