Thread: Patch for fail-back without fresh backup
Hello,
Let me again summarize the problem we are trying to address.
When the master fails, last few WAL files may not reach the standby. But the master may have gone ahead and made changes to its local file system after flushing WAL to the local storage. So master contains some file system level changes that standby does not have. At this point, the data directory of master is ahead of standby's data directory.
Subsequently, the standby will be promoted as new master. Later when the old master wants to be a standby of the new master, it can't just join the setup since there is inconsistency in between these two servers. We need to take the fresh backup from the new master. This can happen in both the synchronous as well as asynchronous replication.
Fresh backup is also needed in case of clean switch-over because in the current HEAD, the master does not wait for the standby to receive all the WAL up to the shutdown checkpoint record before shutting down the connection. Fujii Masao has already submitted a patch to handle clean switch-over case, but the problem is still remaining for failback case.
The process of taking fresh backup is very time consuming when databases are of very big sizes, say several TB's, and when the servers are connected over a relatively slower link. This would break the service level agreement of disaster recovery system. So there is need to improve the process of disaster recovery in PostgreSQL. One way to achieve this is to maintain consistency between master and standby which helps to avoid need of fresh backup.
So our proposal on this problem is that we must ensure that master should not make any file system level changes without confirming that the corresponding WAL record is replicated to the standby.
There are many suggestions and objections pgsql-hackers about this problem The brief summary is as follows:
1. The main objection was raised by Tom and others is that we should not add this feature and should go with traditional way of taking fresh backup using the rsync, because he was concerned about the additional complexity of the patch and the performance overhead during normal operations.
2. Tom and others were also worried about the inconsistencies in the crashed master and suggested that its better to start with a fresh backup. Fujii Masao and others correctly countered that suggesting that we trust WAL recovery to clear all such inconsistencies and there is no reason why we can't do the same here.
3. Someone is suggested using rsync with checksum, but many pages on the two servers may differ in their binary values because of hint bits etc.
4. The major objection for failback without fresh backup idea was it may introduce on performance overhead and complexity to the code. By looking at the patch I must say that patch is not too complex. For performance impact I tested patch with pgbench which shows that it has very small performance overhead. Please refer the test results included at end of mail.
*Proposal to solve the problem*
The proposal is based on the concept of master should not do any file system level change until corresponding WAL record is replicated to the standby.
There are many places in the code which need to be handled to support the proposed solution. Following cases explains the need of fresh backup at the time of failover, and how can we avoid this need by our approach.
1. We must not write any heap pages to the disk before the WAL records corresponding to those changes are received by the standby. Otherwise if standby failed to receive WAL corresponding to those heap pages there will be inconsistency.
2. When CHECKPOINT happens on the master, control file of master gets updated and last checkpoint record is written to it. Suppose failover happens and standby fails to receive the WAL record corresponding to CHECKPOINT, then master and standby has inconsistent copies of control file that leads to the mismatch in redo record and recovery will not start normally. To avoid this situation we must not update the control file of master before the corresponding checkpoint WAL record is received by the standby
3. Also when we truncate any of the physical files on the master and suppose the standby failed to receive corresponding WAL, then that physical file is truncated on master but still available on standby causing inconsistency. To avoid this we must not truncate physical files on the master before the WAL record corresponding to that operation is received by the standby.
4. Same case applies to CLOG pages. If CLOG page is written to the disk and corresponding WAL record is not replicated to the standby, leads to the inconsistency. So we must not write the CLOG pages (and may be other SLRU pages too) to the disk before the corresponding WAL records are received by standby.
5. The same problem applies for the commit hint bits. But it is more complicated than the other problems, because no WAL records are generated for that, hence we cannot apply the same above method, that is wait for corresponding WAL record to be replicated on standby. So we delay the processes of updating the commit hint bits, similar to what is done by asynchronous commits. In other words we need to check if the WAL corresponding to the transaction commit is received by the failback safe standby and then only allow hint bit updates.
*Patch explanation:*
The initial work on this patch is done by Pavan Deolasee. I tested it and will make further enhancements based on the community feedback.
This patch is not complete yet, but I plan to do so with the help of this community. At this point, the primary purpose is to understand the complexities and get some initial performance numbers to alleviate some of the concerns raised by the community.
There are two GUC parameters which supports this failsafe standby
1. failback_safe_standby_name [ name of the failsafe standby ] It is the name of failsafe standby. Master will not do any file system level change before corresponding WAL is replicated on the this failsafe standby
2. failback_safe_standby_mode [ off/remote_write/remote_flush] This parameter specifies the behavior of master i.e. whether it should wait for WAL to be written on standby or WAL to be flushed on standby. We should turn it off when we do not want the failsafe standby. This failsafe mode can be combined with synchronous as well as asynchronous streaming replication.
Most of the changes are done in the syncrep.c. This is a slight misnomer because that file deals with synchronous standby and a failback standby could and most like be a async standby. But keeping the changes this way has ensured that the patch is easy to read. Once we have acceptance on the approach, the patch can be modified to reorganize the code in a more logical way.
The patch adds a new state SYNC_REP_WAITING_FOR_FAILBACK_SAFETY to the sync standby states. A backend which is waiting for a failback safe standby to receive WAL records, will wait in this state. Failback safe mechanism can work in two different modes, that is wait for WAL to be written or flushed on failsafe standby. That is represented by two new modes SYNC_REP_WAIT_FAILBACK_SAFE_WRITE and SYNC_REP_WAIT_FAILBACK_SAFE_FLUSH respectively.
Also the SyncRepWaitForLSN() is changed for conditional wait. So that we can delay hint bit updates on master instead of blocking the wait for the failback safe standby to receiver WAL's.
*Benchmark tests*
*PostgreSQL versions:* PostgreSQL 9.3beta1
*Usage:* For operating in failsafe mode you need to configure following two GUC parameters:
1. failback_safe_standby_name
2.failback_safe_standby_mode
*Performance impact:*
The test are performed on the servers having 32 GB RAM, checkpoint_timeout is set to 10 minutes so that checkpoint will happen more frequently. Checkpoint involves flushing all the dirty blocks to the disk and we wanted to primarily test that code path.
pgbech settings:
Transaction type: TPC-B
Scaling factor: 100
Query mode: simple
Number of clients: 100
Number of threads: 1
Duration: 1800 s
Following table shows the average TPS measured for each scenario. We conducted 3 tests for each scenario
1) Synchronous Replication - 947 tps
2) Synchronous Replication + Failsafe standby (off) - 934 tps
3) Synchronous Replication + Failsafe standby (remote_flush) - 931 tps
4) Asynchronous Replication - 1369 tps
5) Asynchronous Replication + Failsafe standby (off) - 1349 tps
6) Asynchronous Replication + Failsafe standby (remote_flush) - 1350 tps
By observing the table we can conclude following:
1. Streaming replication + failback safe:
a) On an average, synchronous replication combined with failsafe standby (remote_flush) causes 1.68 % performance overhead.
b) On an average, asynchronous streaming replication combined with failsafe standby (remote_flush) causes averagely 1.38 % performance degradation.
2. Streaming replication + failback safe (turned off):
a) Averagely synchronous replication combined with failsafe standby
(off) causes 1.37 % performance overhead.
b) Averagely asynchronous streaming replication combined with failsafe standby (off) causes averagely 1.46 % performance degradation.
So the patch is showing 1-2% performance overhead.
Please give your suggestions if there is a need to perform tests for other scenario.
*Improvements (To-do):*
1. Currently this patch supports only one failback safe standby. It can either be synchronous or an asynchronous standby. We probably need to discuss whether it needs to be changed for support of multiple failsafe standby's.
2. Current design of patch will wait forever for the failback safe standby. Streaming replication also has same limitation. We probably need to discuss whether it needs to be changed.
There are couples of more places that probably need some attention and I have marked them with XXX
Thank you,
Samrat
Attachment
Hello,
Let me again summarize the problem we are trying to address.
When the master fails, last few WAL files may not reach the standby. But the master may have gone ahead and made changes to its local file system after flushing WAL to the local storage. So master contains some file system level changes that standby does not have. At this point, the data directory of master is ahead of standby's data directory.
Subsequently, the standby will be promoted as new master. Later when the old master wants to be a standby of the new master, it can't just join the setup since there is inconsistency in between these two servers. We need to take the fresh backup from the new master. This can happen in both the synchronous as well as asynchronous replication.
Fresh backup is also needed in case of clean switch-over because in the current HEAD, the master does not wait for the standby to receive all the WAL up to the shutdown checkpoint record before shutting down the connection. Fujii Masao has already submitted a patch to handle clean switch-over case, but the problem is still remaining for failback case.
The process of taking fresh backup is very time consuming when databases are of very big sizes, say several TB's, and when the servers are connected over a relatively slower link. This would break the service level agreement of disaster recovery system. So there is need to improve the process of disaster recovery in PostgreSQL. One way to achieve this is to maintain consistency between master and standby which helps to avoid need of fresh backup.
So our proposal on this problem is that we must ensure that master should not make any file system level changes without confirming that the corresponding WAL record is replicated to the standby.
There are many suggestions and objections pgsql-hackers about this problem The brief summary is as follows:
On 14.06.2013 12:11, Samrat Revagade wrote: > We have already started a discussion on pgsql-hackers for the problem of > taking fresh backup during the failback operation here is the link for that: > > http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com > > Let me again summarize the problem we are trying to address. > > When the master fails, last few WAL files may not reach the standby. But > the master may have gone ahead and made changes to its local file system > after flushing WAL to the local storage. So master contains some file > system level changes that standby does not have. At this point, the data > directory of master is ahead of standby's data directory. > > Subsequently, the standby will be promoted as new master. Later when the > old master wants to be a standby of the new master, it can't just join the > setup since there is inconsistency in between these two servers. We need to > take the fresh backup from the new master. This can happen in both the > synchronous as well as asynchronous replication. Did you see the thread on the little tool I wrote called pg_rewind? http://www.postgresql.org/message-id/519DF910.4020609@vmware.com It solves that problem, for both clean and unexpected shutdown. It needs some more work and a lot more testing, but requires no changes to the backend. Robert Haas pointed out in that thread that it has a problem with hint bits that are not WAL-logged, but it will still work if you also enable the new checksums feature, which forces hint bit updates to be WAL-logged. Perhaps we could add a GUC to enable hint bits to be WAL-logged, regardless of checksums, to make pg_rewind work. I think that's a more flexible approach to solve this problem. It doesn't require an online feedback loop from the standby to master, for starters. - Heikki
Robert Haas pointed out in that thread that it has a problem with hint bits that are not WAL-logged,
but it will still work if you also enable the new checksums feature, which forces hint bit updates to be WAL-logged.
Perhaps we could add a GUC to enable hint bits to be WAL-logged, regardless of checksums, to make pg_rewind work.
I think that's a more flexible approach to solve this problem. It doesn't require an online feedback loop from the standby to master, for starters.
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
A alternative proposal (which will probably just reveal my lack of understanding about what is or isn't possible with WAL). Provide a way to restart the master so that it rolls back the WAL changes that the slave hasn't seen.
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
On 14.06.2013 14:06, Pavan Deolasee wrote: > On Fri, Jun 14, 2013 at 4:12 PM, Heikki Linnakangas<hlinnakangas@vmware.com >> wrote: > >> Robert Haas pointed out in that thread that it has a problem with hint >> bits that are not WAL-logged, > > I liked that tool a lot until Robert pointed out the above problem. I > thought this is a show stopper because I can't really see any way to > circumvent it unless we enable checksums or explicitly WAL log hint bits. > >> but it will still work if you also enable the new checksums feature, which >> forces hint bit updates to be WAL-logged. > > Are we expecting a lot of people to run their clusters with checksums on ? > Sorry, I haven't followed the checksum discussions and don't know how much > overhead it causes. But if the general expectation is that checksums will > be turned on most often, I agree pg_rewind is probably good enough. Well, time will tell I guess. The biggest overhead with the checksums is exactly the WAL-logging of hint bits. >> Perhaps we could add a GUC to enable hint bits to be WAL-logged, >> regardless of checksums, to make pg_rewind work. > > Wouldn't that be too costly ? I mean, in the worst case every hint bit on a > page may get updated separately. If each such update is WAL logged, we are > looking for a lot more unnecessary WAL traffic. Yep, same as with checksums. I was not very enthusiastic about the checksums patch because of that, but a lot of people are willing to pay that price. Maybe we can figure out a way to reduce that cost in 9.4. It'd benefit the checksums greatly. For pg_rewind, we wouldn't actually need a full-page image for hint bit updates, just a small record saying "hey, I touched this page". And you'd only need to write that the first time a page is touched after a checkpoint. >> I think that's a more flexible approach to solve this problem. It doesn't >> require an online feedback loop from the standby to master, for starters. > > I agree. That's a big advantage of pg_rewind. Unfortunately, it can't work > with 9.3 and below because of the hint bits issue, otherwise it would have > been even more cool. The proposed patch is clearly not 9.3 material either. If anything, there's a much better change that we could still sneak in a GUC to allow hint bits to be WAL-logged without checksums in 9.3. All the code is there, it'd just be a new guc to control it separetely from checksums. - Heikki
On Fri, Jun 14, 2013 at 12:20 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > For pg_rewind, we wouldn't actually need a full-page image for hint bit > updates, just a small record saying "hey, I touched this page". And you'd > only need to write that the first time a page is touched after a checkpoint. I would expect that to be about the same cost though. The latency for the fsync on the wal record before being able to flush the buffer is the biggest cost. > The proposed patch is clearly not 9.3 material either. If anything, there's > a much better change that we could still sneak in a GUC to allow hint bits > to be WAL-logged without checksums in 9.3. All the code is there, it'd just > be a new guc to control it separetely from checksums. On the other hand if you're going to wal log the hint bits why not enable checksums? Do we allow turning off checksums after a database is initdb'd? IIRC we can't turn it on later but I don't see why we couldn't turn them off. -- greg
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > Well, time will tell I guess. The biggest overhead with the checksums is > exactly the WAL-logging of hint bits. Refresh my memory as to why we need to WAL-log hints for checksumming? I just had my nose in the part of the checksum patch that tediously copies entire pages out of shared buffers to avoid possible instability of the hint bits while we checksum and write the page. Given that we're paying that cost, I don't see why we'd need to do any extra WAL-logging (above and beyond the log-when-freeze cost that we have to pay already). But I've not absorbed any caffeine yet today, so maybe I'm just missing it. regards, tom lane
On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote: > Hello, > We have already started a discussion on pgsql-hackers for the problem of taking fresh backup during the failback operation here is the link for that: > http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb JgWrFu513s+Q@mail.gmail.com > Let me again summarize the problem we are trying to address. > When the master fails, last few WAL files may not reach the standby. But the master may have gone ahead and made changes to its local file system after > flushing WAL to the local storage. So master contains some file system level changes that standby does not have. At this point, the data directory of > master is ahead of standby's data directory. > Subsequently, the standby will be promoted as new master. Later when the old master wants to be a standby of the new master, it can't just join the > setup since there is inconsistency in between these two servers. We need to take the fresh backup from the new master. This can happen in both the > synchronous as well as asynchronous replication. > Fresh backup is also needed in case of clean switch-over because in the current HEAD, the master does not wait for the standby to receive all the WAL > up to the shutdown checkpoint record before shutting down the connection. Fujii Masao has already submitted a patch to handle clean switch-over case, > but the problem is still remaining for failback case. > The process of taking fresh backup is very time consuming when databases are of very big sizes, say several TB's, and when the servers are connected > over a relatively slower link. This would break the service level agreement of disaster recovery system. So there is need to improve the process of > disaster recovery in PostgreSQL. One way to achieve this is to maintain consistency between master and standby which helps to avoid need of fresh > backup. > So our proposal on this problem is that we must ensure that master should not make any file system level changes without confirming that the > corresponding WAL record is replicated to the standby. How will you take care of extra WAL on old master during recovery. If it plays the WAL which has not reached new-master, it can be a problem. With Regards, Amit Kapila.
On 2013-06-14 09:08:15 -0400, Tom Lane wrote: > Heikki Linnakangas <hlinnakangas@vmware.com> writes: > > Well, time will tell I guess. The biggest overhead with the checksums is > > exactly the WAL-logging of hint bits. > > Refresh my memory as to why we need to WAL-log hints for checksumming? > I just had my nose in the part of the checksum patch that tediously > copies entire pages out of shared buffers to avoid possible instability > of the hint bits while we checksum and write the page. I am really rather uncomfortable with that piece of code, and I hacked it up after Jeff Janes had reported a bug there (The one aborting WAL replay to early...). So I am very happy that you are looking at it. Jeff Davis and I were talking about whether the usage of PGXAC->delayChkpt makes the whole thing sufficiently safe at pgcon - we couldn't find any real danger but... > Given that we're > paying that cost, I don't see why we'd need to do any extra WAL-logging > (above and beyond the log-when-freeze cost that we have to pay already). > But I've not absorbed any caffeine yet today, so maybe I'm just missing > it. The usual torn page spiel I think. If we crash while only one half of the page made it to disk we would get spurious checksum failures from there on. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 14.06.2013 16:08, Tom Lane wrote: > Heikki Linnakangas<hlinnakangas@vmware.com> writes: >> Well, time will tell I guess. The biggest overhead with the checksums is >> exactly the WAL-logging of hint bits. > > Refresh my memory as to why we need to WAL-log hints for checksumming? Torn pages: 1. Backend sets a hint bit, dirtying the buffer. 2. Checksum is calculated, and buffer is written out to disk. 3. <crash> If the page is torn, the checksum won't match. Without checksums, a torn page is not a problem with hint bits, as a single bit can't be torn and the page is otherwise intact. But with checksums, it causes a checksum failure. - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 14.06.2013 16:08, Tom Lane wrote: >> Refresh my memory as to why we need to WAL-log hints for checksumming? > Torn pages: So it's not that we actually need to log the individual hint bit changes, it's that we need to WAL-log a full page image on the first update after a checkpoint, so as to recover from torn-page cases. Which one are we doing? regards, tom lane
On 14.06.2013 16:21, Tom Lane wrote: > Heikki Linnakangas<hlinnakangas@vmware.com> writes: >> On 14.06.2013 16:08, Tom Lane wrote: >>> Refresh my memory as to why we need to WAL-log hints for checksumming? > >> Torn pages: > > So it's not that we actually need to log the individual hint bit > changes, it's that we need to WAL-log a full page image on the first > update after a checkpoint, so as to recover from torn-page cases. > Which one are we doing? Correct. We're doing the latter, see XLogSaveBufferForHint(). - Heikki
On 2013-06-14 09:21:52 -0400, Tom Lane wrote: > Heikki Linnakangas <hlinnakangas@vmware.com> writes: > > On 14.06.2013 16:08, Tom Lane wrote: > >> Refresh my memory as to why we need to WAL-log hints for checksumming? > > > Torn pages: > > So it's not that we actually need to log the individual hint bit > changes, it's that we need to WAL-log a full page image on the first > update after a checkpoint, so as to recover from torn-page cases. > Which one are we doing? MarkBufferDirtyHint() loggs an FPI (just not via a BKP block) via XLogSaveBufferForHint() iff XLogCheckBuffer() says we need to by comparing GetRedoRecPtr() with the page's lsn. Otherwise we don't do anything besides marking the buffer dirty. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 14.06.2013 16:15, Andres Freund wrote: > On 2013-06-14 09:08:15 -0400, Tom Lane wrote: >> I just had my nose in the part of the checksum patch that tediously >> copies entire pages out of shared buffers to avoid possible instability >> of the hint bits while we checksum and write the page. > > I am really rather uncomfortable with that piece of code, and I hacked > it up after Jeff Janes had reported a bug there (The one aborting WAL > replay to early...). So I am very happy that you are looking at it. Hmm. In XLogSaveBufferForHint(): > * Note that this only works for buffers that fit the standard page model, > * i.e. those for which buffer_std == true The free-space-map uses non-standard pages, and MarkBufferDirtyHint(). Isn't that completely broken for the FSM? If I'm reading it correctly, what will happen is that replay will completely zero out all FSM pages that have been touched. All the FSM data is between pd_lower and pd_upper, which on standard pages is the "hole". - Heikki
On 2013-06-14 09:21:52 -0400, Tom Lane wrote: > Heikki Linnakangas <hlinnakangas@vmware.com> writes: > > On 14.06.2013 16:08, Tom Lane wrote: > >> Refresh my memory as to why we need to WAL-log hints for checksumming? > > > Torn pages: > > So it's not that we actually need to log the individual hint bit > changes, it's that we need to WAL-log a full page image on the first > update after a checkpoint, so as to recover from torn-page cases. > Which one are we doing? From quickly looking at the code again I think the MarkBufferDirtyHint() code makes at least one assumption that isn't correct in the fact of checksums. It tests for the need to dirty the page with:if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) != (BM_DIRTY | BM_JUST_DIRTIED)) *before* taking a lock. A comment explains why that is safe: * Since we make this test unlocked, there's a chance we * might fail to notice that the flags have just been cleared,and failed * to reset them, due to memory-ordering issues. That's fine for the classical usecase without checksums but what about the following scenario: 1) page is dirtied, FPI is logged 2) SetHintBits gets called on the same page, holding only a share lock 3) checkpointer/bgwriter/... writes out the the page, clearing the dirty flag 4) checkpoint finishes, updates redo ptr 5) SetHintBits actually modifies the hint bits 6) SetHintBits calls MarkBufferDirtyHint which doesn't notice that the page isn't dirty anymore and thus doesn't check whethersomething needs to get logged. At this point we have a page that has been modified without an FPI. But it's not marked dirty, so it won't be written out without further cause. Which might be fine since there's no cause to write out the page and there probably won't be anyone doing that without logging an FPI independently. Can anybody see a scenario where this is actually dangerous? Since Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 14.06.2013 17:01, Andres Freund wrote: > At this point we have a page that has been modified without an FPI. But > it's not marked dirty, so it won't be written out without further > cause. Which might be fine since there's no cause to write out the page > and there probably won't be anyone doing that without logging an FPI > independently. > Can anybody see a scenario where this is actually dangerous? The code also relies on that being safe during recovery: > * If we're in recovery we cannot dirty a page because of a hint. > * We can set the hint, just not dirty the page as a result so the > * hint is lost when we evict the page or shutdown. > * > * See src/backend/storage/page/README for longer discussion. > */ > if (RecoveryInProgress()) > return; I can't immediately see a problem with that. - Heikki
On 2013-06-14 16:58:38 +0300, Heikki Linnakangas wrote: > On 14.06.2013 16:15, Andres Freund wrote: > >On 2013-06-14 09:08:15 -0400, Tom Lane wrote: > >>I just had my nose in the part of the checksum patch that tediously > >>copies entire pages out of shared buffers to avoid possible instability > >>of the hint bits while we checksum and write the page. > > > >I am really rather uncomfortable with that piece of code, and I hacked > >it up after Jeff Janes had reported a bug there (The one aborting WAL > >replay to early...). So I am very happy that you are looking at it. > > Hmm. In XLogSaveBufferForHint(): > > > * Note that this only works for buffers that fit the standard page model, > > * i.e. those for which buffer_std == true > > The free-space-map uses non-standard pages, and MarkBufferDirtyHint(). Isn't > that completely broken for the FSM? If I'm reading it correctly, what will > happen is that replay will completely zero out all FSM pages that have been > touched. All the FSM data is between pd_lower and pd_upper, which on > standard pages is the "hole". Jeff Davis has a patch pending (1365493015.7580.3240.camel@sussancws0025) that passes the buffer_std flag down to MarkBufferDirtyHint() for exactly that reason. I thought we were on track committing that, but rereading the thread it doesn't look that way. Jeff, care to update that patch? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jun 14, 2013 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > So it's not that we actually need to log the individual hint bit > changes, it's that we need to WAL-log a full page image on the first > update after a checkpoint, so as to recover from torn-page cases. > Which one are we doing? Wal logging a full page image after a checkpoint wouldn't actually be enough since subsequent hint bits will dirty the page and not wal log anything creating a new torn page risk. FPI are only useful if all the subsequent updates are wal logged. -- greg
Greg Stark <stark@mit.edu> writes: > On Fri, Jun 14, 2013 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> So it's not that we actually need to log the individual hint bit >> changes, it's that we need to WAL-log a full page image on the first >> update after a checkpoint, so as to recover from torn-page cases. >> Which one are we doing? > Wal logging a full page image after a checkpoint wouldn't actually be > enough since subsequent hint bits will dirty the page and not wal log > anything creating a new torn page risk. FPI are only useful if all the > subsequent updates are wal logged. No, there's no new torn page risk, because any crash recovery would replay starting from the checkpoint. You might lose the subsequently-set hint bits, but that's okay. regards, tom lane
On Fri, 2013-06-14 at 16:10 +0200, Andres Freund wrote: > Jeff Davis has a patch pending > (1365493015.7580.3240.camel@sussancws0025) that passes the buffer_std > flag down to MarkBufferDirtyHint() for exactly that reason. I thought we > were on track committing that, but rereading the thread it doesn't look > that way. > > Jeff, care to update that patch? Rebased and attached. Changed so all callers use buffer_std=true except those in freespace.c and fsmpage.c. Simon, did you (or anyone else) have an objection to this patch? If not, I'll go ahead and commit it tomorrow morning. Regards, Jeff Davis
Attachment
On 2013-06-14 09:21:12 -0700, Jeff Davis wrote: > On Fri, 2013-06-14 at 16:10 +0200, Andres Freund wrote: > > Jeff Davis has a patch pending > > (1365493015.7580.3240.camel@sussancws0025) that passes the buffer_std > > flag down to MarkBufferDirtyHint() for exactly that reason. I thought we > > were on track committing that, but rereading the thread it doesn't look > > that way. > > > > Jeff, care to update that patch? > > Rebased and attached. Changed so all callers use buffer_std=true except > those in freespace.c and fsmpage.c. > > Simon, did you (or anyone else) have an objection to this patch? If not, > I'll go ahead and commit it tomorrow morning. I'd like to see a comment around the memcpys in XLogSaveBufferForHint() that mentions that they are safe in a non std buffer due to XLogCheckBuffer setting an appropriate hole/offset. Or make an explicit change of the copy algorithm there. Btw, if you touch that code, I'd vote for renaming XLOG_HINT to XLOG_FPI or something like that. I find the former name confusing... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila <amit.kapila@huawei.com> wrote: > On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote: >> Hello, > >> We have already started a discussion on pgsql-hackers for the problem of > taking fresh backup during the failback operation here is the link for that: > >> > http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb > JgWrFu513s+Q@mail.gmail.com > >> Let me again summarize the problem we are trying to address. > >> When the master fails, last few WAL files may not reach the standby. But > the master may have gone ahead and made changes to its local file system > after > flushing WAL to the local storage. So master contains some file > system level changes that standby does not have. At this point, the data > directory of > master is ahead of standby's data directory. >> Subsequently, the standby will be promoted as new master. Later when the > old master wants to be a standby of the new master, it can't just join the >> setup since there is inconsistency in between these two servers. We need > to take the fresh backup from the new master. This can happen in both the >> synchronous as well as asynchronous replication. > >> Fresh backup is also needed in case of clean switch-over because in the > current HEAD, the master does not wait for the standby to receive all the > WAL >> up to the shutdown checkpoint record before shutting down the connection. > Fujii Masao has already submitted a patch to handle clean switch-over case, >> but the problem is still remaining for failback case. > >> The process of taking fresh backup is very time consuming when databases > are of very big sizes, say several TB's, and when the servers are connected >> over a relatively slower link. This would break the service level > agreement of disaster recovery system. So there is need to improve the > process of >> disaster recovery in PostgreSQL. One way to achieve this is to maintain > consistency between master and standby which helps to avoid need of fresh >> backup. > >> So our proposal on this problem is that we must ensure that master should > not make any file system level changes without confirming that the >> corresponding WAL record is replicated to the standby. > > How will you take care of extra WAL on old master during recovery. If it > plays the WAL which has not reached new-master, it can be a problem. you means that there is possible that old master's data ahead of new master's data. so there is inconsistent data between those server when fail back. right? if so , there is not possible inconsistent. because if you use GUC option as his propose (i.g., failback_safe_standby_mode = remote_flush), when old master is working fine, all file system level changes aren't done before WAL replicated. -- Regards, ------- Sawada Masahiko
On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote: On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila <amit.kapila@huawei.com> wrote: > On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote: >> Hello, > >>> We have already started a discussion on pgsql-hackers for the problem of >>> taking fresh backup during the failback operation here is the link for that: >>> >>> >>> http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb >>> JgWrFu513s+Q@mail.gmail.com >>> >>> Let me again summarize the problem we are trying to address. >> >> >> How will you take care of extra WAL on old master during recovery. If it >> plays the WAL which has not reached new-master, it can be a problem. > you means that there is possible that old master's data ahead of new > master's data. I mean to say is that WAL of old master can be ahead of new master. I understood that data files of old master can't beahead, but I think WAL can be ahead. > so there is inconsistent data between those server when fail back. right? > if so , there is not possible inconsistent. because if you use GUC option > as his propose (i.g., failback_safe_standby_mode = remote_flush), > when old master is working fine, all file system level changes aren't > done before WAL replicated. Would the propose patch will take care that old master's WAL is also not ahead in some way? If yes, I think i am missing some point. With Regards, Amit Kapila.
On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila <amit.kapila@huawei.com> wrote: > > On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote: > On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila <amit.kapila@huawei.com> wrote: >> On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote: >>> Hello, >> >>>> We have already started a discussion on pgsql-hackers for the problem of >>>> taking fresh backup during the failback operation here is the link for that: >>>> >>>> >>>> http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb >>>> JgWrFu513s+Q@mail.gmail.com >>>> >>>> Let me again summarize the problem we are trying to address. >>> >>> >>> How will you take care of extra WAL on old master during recovery. If it >>> plays the WAL which has not reached new-master, it can be a problem. > >> you means that there is possible that old master's data ahead of new >> master's data. > > I mean to say is that WAL of old master can be ahead of new master. I understood that > data files of old master can't be ahead, but I think WAL can be ahead. > >> so there is inconsistent data between those server when fail back. right? >> if so , there is not possible inconsistent. because if you use GUC option >> as his propose (i.g., failback_safe_standby_mode = remote_flush), >> when old master is working fine, all file system level changes aren't >> done before WAL replicated. > > Would the propose patch will take care that old master's WAL is also not ahead in some way? > If yes, I think i am missing some point. yes it will happen that old master's WAL ahead of new master's WAL as you said. but I think that we can solve them by delete all WAL file when old master starts as new standby.thought? Regards, ------- Sawada Masahiko
On Fri, 2013-06-14 at 18:27 +0200, Andres Freund wrote: > I'd like to see a comment around the memcpys in XLogSaveBufferForHint() > that mentions that they are safe in a non std buffer due to > XLogCheckBuffer setting an appropriate hole/offset. Or make an explicit > change of the copy algorithm there. Done. > Btw, if you touch that code, I'd vote for renaming XLOG_HINT to XLOG_FPI > or something like that. I find the former name confusing... Also done. Patch attached. Also, since we branched, I think this should be back-patched to 9.3 as well. Regards, Jeff Davis
Attachment
On 2013-06-15 11:36:54 -0700, Jeff Davis wrote: > On Fri, 2013-06-14 at 18:27 +0200, Andres Freund wrote: > > I'd like to see a comment around the memcpys in XLogSaveBufferForHint() > > that mentions that they are safe in a non std buffer due to > > XLogCheckBuffer setting an appropriate hole/offset. Or make an explicit > > change of the copy algorithm there. > > Done. > Also done. Thanks! Looks good to me. > Patch attached. Also, since we branched, I think this should be > back-patched to 9.3 as well. Absolutely. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Saturday, June 15, 2013 8:29 PM Sawada Masahiko wrote: On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila <amit.kapila@huawei.com> wrote: > > On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote: > On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila <amit.kapila@huawei.com> wrote: >> On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote: >>> Hello, >> >>>>> We have already started a discussion on pgsql-hackers for the problem of >>>>> taking fresh backup during the failback operation here is the link for that: >>>>> >>>>> >>>>> http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb >>>>> JgWrFu513s+Q@mail.gmail.com >>>>> >>>>> Let me again summarize the problem we are trying to address. >>>>> >>>>> >>>> How will you take care of extra WAL on old master during recovery. If it >>>> plays the WAL which has not reached new-master, it can be a problem. > >>> you means that there is possible that old master's data ahead of new >>> master's data. > >> I mean to say is that WAL of old master can be ahead of new master. I understood that >> data files of old master can't be ahead, but I think WAL can be ahead. > >>> so there is inconsistent data between those server when fail back. right? >>> if so , there is not possible inconsistent. because if you use GUC option >>> as his propose (i.g., failback_safe_standby_mode = remote_flush), >>> when old master is working fine, all file system level changes aren't >>> done before WAL replicated. > >> Would the propose patch will take care that old master's WAL is also not ahead in some way? >> If yes, I think i am missing some point. > yes it will happen that old master's WAL ahead of new master's WAL as you said. > but I think that we can solve them by delete all WAL file when old > master starts as new standby. I think ideally, it should reset WAL location at the point where new master has forrked off. In such a scenario it would be difficult for user who wants to get a dump of some data in old master which hasn't gone to new master. I am not sure if such a need is there for real users, but if it is there, then providing this solution will have some drawbacks. With Regards, Amit Kapila.
On 14 June 2013 17:21, Jeff Davis <pgsql@j-davis.com> wrote: > On Fri, 2013-06-14 at 16:10 +0200, Andres Freund wrote: >> Jeff Davis has a patch pending >> (1365493015.7580.3240.camel@sussancws0025) that passes the buffer_std >> flag down to MarkBufferDirtyHint() for exactly that reason. I thought we >> were on track committing that, but rereading the thread it doesn't look >> that way. >> >> Jeff, care to update that patch? > > Rebased and attached. Changed so all callers use buffer_std=true except > those in freespace.c and fsmpage.c. > > Simon, did you (or anyone else) have an objection to this patch? If not, > I'll go ahead and commit it tomorrow morning. I didn't have a specific objection to the patch, I just wanted to minimise change relating to this so we didn't introduce further bugs. I've no objection to you committing that. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 14 June 2013 10:11, Samrat Revagade <revagade.samrat@gmail.com> wrote: > We have already started a discussion on pgsql-hackers for the problem of > taking fresh backup during the failback operation here is the link for that: > > http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com > So our proposal on this problem is that we must ensure that master should > not make any file system level changes without confirming that the > corresponding WAL record is replicated to the standby. > 1. The main objection was raised by Tom and others is that we should not add > this feature and should go with traditional way of taking fresh backup using > the rsync, because he was concerned about the additional complexity of the > patch and the performance overhead during normal operations. > > 2. Tom and others were also worried about the inconsistencies in the crashed > master and suggested that its better to start with a fresh backup. Fujii > Masao and others correctly countered that suggesting that we trust WAL > recovery to clear all such inconsistencies and there is no reason why we > can't do the same here. > So the patch is showing 1-2% performance overhead. Let's have a look at this... The objections you summarise that Tom has made are ones that I agree with. I also don't think that Fujii "correctly countered" those objections. My perspective is that if the master crashed, assuming that you know everything about that and suddenly jumping back on seem like a recipe for disaster. Attempting that is currently blocked by the technical obstacles you've identified, but that doesn't mean they are the only ones - we don't yet understand what all the problems lurking might be. Personally, I won't be following you onto that minefield anytime soon. So I strongly object to calling this patch anything to do with "failback safe". You simply don't have enough data to make such a bold claim. (Which is why we call it synchronous replication and not "zero data loss", for example). But that's not the whole story. I can see some utility in a patch that makes all WAL transfer synchronous, rather than just commits. Some name like synchronous_transfer might be appropriate. e.g. synchronous_transfer = all | commit (default). The idea of another slew of parameters that are very similar to synchronous replication but yet somehow different seems weird. I can't see a reason why we'd want a second lot of parameters. Why not just use the existing ones for sync rep? (I'm surprised the Parameter Police haven't visited you in the night...) Sure, we might want to expand the design for how we specify multi-node sync rep, but that is a different patch. I'm worried to see that adding this feature and yet turning it off causes a measureable drop in performance. I don't think we want that at all. That clearly needs more work and thought. I also think your performance results are somewhat bogus. Fast transaction workloads were already mostly commit waits - measurements of what happens to large loads, index builds etc would likely reveal something quite different. I'm tempted by the thought that we should put the WaitForLSN inside XLogFlush, rather than scatter additional calls everywhere and then have us inevitably miss one. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
So I strongly object to calling this patch anything to do with
"failback safe". You simply don't have enough data to make such a bold
claim. (Which is why we call it synchronous replication and not "zero
data loss", for example).
But that's not the whole story. I can see some utility in a patch that
makes all WAL transfer synchronous, rather than just commits. Some
name like synchronous_transfer might be appropriate. e.g.
synchronous_transfer = all | commit (default).
The idea of another slew of parameters that are very similar to
synchronous replication but yet somehow different seems weird. I can't
see a reason why we'd want a second lot of parameters. Why not just
use the existing ones for sync rep? (I'm surprised the Parameter
Police haven't visited you in the night...) Sure, we might want to
expand the design for how we specify multi-node sync rep, but that is
a different patch.
I'm worried to see that adding this feature and yet turning it off
causes a measureable drop in performance. I don't think we want that
at all. That clearly needs more work and thought.
I also think your performance results are somewhat bogus. Fast
transaction workloads were already mostly commit waits - measurements
of what happens to large loads, index builds etc would likely reveal
something quite different.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 16 June 2013 17:25, Samrat Revagade <revagade.samrat@gmail.com> wrote: > > > On Sun, Jun 16, 2013 at 5:10 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> >> >> >> So I strongly object to calling this patch anything to do with >> "failback safe". You simply don't have enough data to make such a bold >> claim. (Which is why we call it synchronous replication and not "zero >> data loss", for example). >> >> But that's not the whole story. I can see some utility in a patch that >> makes all WAL transfer synchronous, rather than just commits. Some >> name like synchronous_transfer might be appropriate. e.g. >> synchronous_transfer = all | commit (default). >> > > I agree with you about the fact that, > Now a days the need of fresh backup in crash recovery seems to be a major > problem. > we might need to change the name of patch if there other problems too with > crash recovery. (Sorry don't understand) >> The idea of another slew of parameters that are very similar to >> synchronous replication but yet somehow different seems weird. I can't >> see a reason why we'd want a second lot of parameters. Why not just >> use the existing ones for sync rep? (I'm surprised the Parameter >> Police haven't visited you in the night...) Sure, we might want to >> expand the design for how we specify multi-node sync rep, but that is >> a different patch. >> > > The different set of parameters are needed to differentiate between > fail-safe standby and synchronous standby, the fail-safe standby and standby > in synchronous replication can be two different servers. Why would they be different? What possible reason would you have for that config? There is no *need* for those parameters, the proposal could work perfectly well without them. Let's make this patch fulfill the stated objectives, not add in optional extras, especially ones that don't appear well thought through. If you wish to enhance the design for the specification of multi-node sync rep, make that a separate patch, later. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 16 June 2013 17:25, Samrat Revagade <revagade.samrat@gmail.com> wrote:(Sorry don't understand)
>
>
> On Sun, Jun 16, 2013 at 5:10 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>>
>>
>> So I strongly object to calling this patch anything to do with
>> "failback safe". You simply don't have enough data to make such a bold
>> claim. (Which is why we call it synchronous replication and not "zero
>> data loss", for example).
>>
>> But that's not the whole story. I can see some utility in a patch that
>> makes all WAL transfer synchronous, rather than just commits. Some
>> name like synchronous_transfer might be appropriate. e.g.
>> synchronous_transfer = all | commit (default).
>>
>
> I agree with you about the fact that,
> Now a days the need of fresh backup in crash recovery seems to be a major
> problem.
> we might need to change the name of patch if there other problems too with
> crash recovery.
>> The idea of another slew of parameters that are very similar toWhy would they be different? What possible reason would you have for
>> synchronous replication but yet somehow different seems weird. I can't
>> see a reason why we'd want a second lot of parameters. Why not just
>> use the existing ones for sync rep? (I'm surprised the Parameter
>> Police haven't visited you in the night...) Sure, we might want to
>> expand the design for how we specify multi-node sync rep, but that is
>> a different patch.
>>
>
> The different set of parameters are needed to differentiate between
> fail-safe standby and synchronous standby, the fail-safe standby and standby
> in synchronous replication can be two different servers.
that config? There is no *need* for those parameters, the proposal
could work perfectly well without them.
Let's make this patch fulfill the stated objectives, not add in
optional extras, especially ones that don't appear well thought
through. If you wish to enhance the design for the specification of
multi-node sync rep, make that a separate patch, later.
My perspective is that if the master crashed, assuming that you know
everything about that and suddenly jumping back on seem like a recipe
for disaster. Attempting that is currently blocked by the technical
obstacles you've identified, but that doesn't mean they are the only
ones - we don't yet understand what all the problems lurking might be.
Personally, I won't be following you onto that minefield anytime soon.
So I strongly object to calling this patch anything to do with
"failback safe". You simply don't have enough data to make such a bold
claim. (Which is why we call it synchronous replication and not "zero
data loss", for example).
But that's not the whole story. I can see some utility in a patch that
makes all WAL transfer synchronous, rather than just commits. Some
name like synchronous_transfer might be appropriate. e.g.
synchronous_transfer = all | commit (default).
The idea of another slew of parameters that are very similar to
synchronous replication but yet somehow different seems weird. I can't
see a reason why we'd want a second lot of parameters. Why not just
use the existing ones for sync rep? (I'm surprised the Parameter
Police haven't visited you in the night...) Sure, we might want to
expand the design for how we specify multi-node sync rep, but that is
a different patch.
I'm worried to see that adding this feature and yet turning it off
causes a measureable drop in performance. I don't think we want that
at all. That clearly needs more work and thought.
I also think your performance results are somewhat bogus. Fast
transaction workloads were already mostly commit waits -
measurements
of what happens to large loads, index builds etc would likely reveal
something quite different.
I'm tempted by the thought that we should put the WaitForLSN inside
XLogFlush, rather than scatter additional calls everywhere and then
have us inevitably miss one.
On 17 June 2013 09:03, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > I agree. We should probably find a better name for this. Any suggestions ? err, I already made one... >> But that's not the whole story. I can see some utility in a patch that >> makes all WAL transfer synchronous, rather than just commits. Some >> name like synchronous_transfer might be appropriate. e.g. >> synchronous_transfer = all | commit (default). > Since commits are more foreground in nature and this feature > does not require us to wait during common foreground activities, we want a > configuration where master can wait for synchronous transfers at other than > commits. May we can solve that by having more granular control to the said > parameter ? > >> >> The idea of another slew of parameters that are very similar to >> synchronous replication but yet somehow different seems weird. I can't >> see a reason why we'd want a second lot of parameters. Why not just >> use the existing ones for sync rep? (I'm surprised the Parameter >> Police haven't visited you in the night...) Sure, we might want to >> expand the design for how we specify multi-node sync rep, but that is >> a different patch. > > > How would we then distinguish between synchronous and the new kind of > standby ? That's not the point. The point is "Why would we have a new kind of standby?" and therefore why do we need new parameters? > I am told, one of the very popular setups for DR is to have one > local sync standby and one async (may be cascaded by the local sync). Since > this new feature is more useful for DR because taking a fresh backup on a > slower link is even more challenging, IMHO we should support such setups. ...which still doesn't make sense to me. Lets look at that in detail. Take 3 servers, A, B, C with A and B being linked by sync rep, and C being safety standby at a distance. Either A or B is master, except in disaster. So if A is master, then B would be the failover target. If A fails, then you want to failover to B. Once B is the target, you want to failback to A as the master. C needs to follow the new master, whichever it is. If you set up sync rep between A and B and this new mode between A and C. When B becomes the master, you need to failback from B from A, but you can't because the new mode applied between A and C only, so you have to failback from C to A. So having the new mode not match with sync rep means you are forcing people to failback using the slow link in the common case. You might observe that having the two modes match causes problems if A and B fail, so you are forced to go to C as master and then eventually failback to A or B across a slow link. That case is less common and could be solved by extending sync transfer to more/multi nodes. It definitely doesn't make sense to have sync rep on anything other than a subset of sync transfer. So while it may be sensible in the future to make sync transfer a superset of sync rep nodes, it makes sense to make them the same config for now. Phew --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Jun 16, 2013 at 2:00 PM, Amit kapila <amit.kapila@huawei.com> wrote: > On Saturday, June 15, 2013 8:29 PM Sawada Masahiko wrote: > On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila <amit.kapila@huawei.com> wrote: >> >> On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote: >> On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila <amit.kapila@huawei.com> wrote: >>> On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote: >>>> Hello, >>> >>>>>> We have already started a discussion on pgsql-hackers for the problem of >>>>>> taking fresh backup during the failback operation here is the link for that: >>>>>> >>>>>> >>>>>> http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb >>>>>> JgWrFu513s+Q@mail.gmail.com >>>>>> >>>>>> Let me again summarize the problem we are trying to address. >>>>>> >>>>>> >>>>> How will you take care of extra WAL on old master during recovery. If it >>>>> plays the WAL which has not reached new-master, it can be a problem. >> >>>> you means that there is possible that old master's data ahead of new >>>> master's data. >> >>> I mean to say is that WAL of old master can be ahead of new master. I understood that >>> data files of old master can't be ahead, but I think WAL can be ahead. >> >>>> so there is inconsistent data between those server when fail back. right? >>>> if so , there is not possible inconsistent. because if you use GUC option >>>> as his propose (i.g., failback_safe_standby_mode = remote_flush), >>>> when old master is working fine, all file system level changes aren't >>>> done before WAL replicated. >> >>> Would the propose patch will take care that old master's WAL is also not ahead in some way? >>> If yes, I think i am missing some point. > >> yes it will happen that old master's WAL ahead of new master's WAL as you said. >> but I think that we can solve them by delete all WAL file when old >> master starts as new standby. > > I think ideally, it should reset WAL location at the point where new master has forrked off. > In such a scenario it would be difficult for user who wants to get a dump of some data in > old master which hasn't gone to new master. I am not sure if such a need is there for real users, but if it > is there, then providing this solution will have some drawbacks. I think that we can dumping data before all WAL files deleting. All WAL files deleting is done when old master starts as new standby. Regards, ------- Sawada Masahiko
On Tuesday, June 18, 2013 12:18 AM Sawada Masahiko wrote: > On Sun, Jun 16, 2013 at 2:00 PM, Amit kapila <amit.kapila@huawei.com> > wrote: > > On Saturday, June 15, 2013 8:29 PM Sawada Masahiko wrote: > > On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila > <amit.kapila@huawei.com> wrote: > >> > >> On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote: > >> On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila > <amit.kapila@huawei.com> wrote: > >>> On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote: > >>>> Hello, > >>> > >>>>>> We have already started a discussion on pgsql-hackers for the > problem of > >>>>>> taking fresh backup during the failback operation here is the > link for that: > >>>>>> > >>>>>> > >>>>>> http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe- > 6OzRaew5pWhk7yQtb > >>>>>> JgWrFu513s+Q@mail.gmail.com > >>>>>> > >>>>>> Let me again summarize the problem we are trying to address. > >>>>>> > >>>>>> > >>>>> How will you take care of extra WAL on old master during > recovery. If it > >>>>> plays the WAL which has not reached new-master, it can be a > problem. > >> > >>>> you means that there is possible that old master's data ahead of > new > >>>> master's data. > >> > >>> I mean to say is that WAL of old master can be ahead of new > master. I understood that > >>> data files of old master can't be ahead, but I think WAL can be > ahead. > >> > >>>> so there is inconsistent data between those server when fail back. > right? > >>>> if so , there is not possible inconsistent. because if you use GUC > option > >>>> as his propose (i.g., failback_safe_standby_mode = remote_flush), > >>>> when old master is working fine, all file system level changes > aren't > >>>> done before WAL replicated. > >> > >>> Would the propose patch will take care that old master's WAL is > also not ahead in some way? > >>> If yes, I think i am missing some point. > > > >> yes it will happen that old master's WAL ahead of new master's WAL > as you said. > >> but I think that we can solve them by delete all WAL file when old > >> master starts as new standby. > > > > I think ideally, it should reset WAL location at the point where new > master has forrked off. > > In such a scenario it would be difficult for user who wants to get a > dump of some data in > > old master which hasn't gone to new master. I am not sure if such a > need is there for real users, but if it > > is there, then providing this solution will have some drawbacks. > I think that we can dumping data before all WAL files deleting. All > WAL files deleting is done when old master starts as new standby. Can we dump data without starting server? With Regards, Amit Kapila.
On Tuesday, June 18, 2013 12:18 AM Sawada Masahiko wrote:
> On Sun, Jun 16, 2013 at 2:00 PM, Amit kapila <amit.kapila@huawei.com>
> wrote:
> > On Saturday, June 15, 2013 8:29 PM Sawada Masahiko wrote:
> > On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila
> <amit.kapila@huawei.com> wrote:
> >>
> >> On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote:
> >> On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila
> <amit.kapila@huawei.com> wrote:
> >>> On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:
> >>>> Hello,
> >>>
> >>>>>> We have already started a discussion on pgsql-hackers for the
> problem of
> >>>>>> taking fresh backup during the failback operation here is the
> link for that:
> >>>>>>
> >>>>>>
> >>>>>> http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-
> 6OzRaew5pWhk7yQtb
> >>>>>> JgWrFu513s+Q@mail.gmail.com
> >>>>>>
> >>>>>> Let me again summarize the problem we are trying to address.
> >>>>>>
> >>>>>>
> >>>>> How will you take care of extra WAL on old master during
> recovery. If it
> >>>>> plays the WAL which has not reached new-master, it can be a
> problem.
> >>
> >>>> you means that there is possible that old master's data ahead of
> new
> >>>> master's data.
> >>
> >>> I mean to say is that WAL of old master can be ahead of new
> master. I understood that
> >>> data files of old master can't be ahead, but I think WAL can be
> ahead.
> >>
> >>>> so there is inconsistent data between those server when fail back.
> right?
> >>>> if so , there is not possible inconsistent. because if you use GUC
> option
> >>>> as his propose (i.g., failback_safe_standby_mode = remote_flush),
> >>>> when old master is working fine, all file system level changes
> aren't
> >>>> done before WAL replicated.
> >>
> >>> Would the propose patch will take care that old master's WAL is
> also not ahead in some way?
> >>> If yes, I think i am missing some point.
> >
> >> yes it will happen that old master's WAL ahead of new master's WAL
> as you said.
> >> but I think that we can solve them by delete all WAL file when old
> >> master starts as new standby.
> >
> > I think ideally, it should reset WAL location at the point where new
> master has forrked off.
> > In such a scenario it would be difficult for user who wants to get a
> dump of some data in
> > old master which hasn't gone to new master. I am not sure if such a
> need is there for real users, but if it
> > is there, then providing this solution will have some drawbacks.
> I think that we can dumping data before all WAL files deleting. All
> WAL files deleting is done when old master starts as new standby.
Can we dump data without starting server?
--
Regards,
-------
Sawada Masahiko
On Tuesday, June 18, 2013 12:18 AM Sawada Masahiko wrote:
> On Sun, Jun 16, 2013 at 2:00 PM, Amit kapila <amit.kapila@huawei.com>
> wrote:
> > On Saturday, June 15, 2013 8:29 PM Sawada Masahiko wrote:
> > On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila
> <amit.kapila@huawei.com> wrote:
> >>
> >> On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote:
> >> On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila
> <amit.kapila@huawei.com> wrote:
> >>> On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:
> >>>> Hello,
> >>>
> >>>>>> We have already started a discussion on pgsql-hackers for the
> problem of
> >>>>>> taking fresh backup during the failback operation here is the
> link for that:
> >>>>>>
> >>>>>>
> >>>>>> http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-
> 6OzRaew5pWhk7yQtb
> >>>>>> JgWrFu513s+Q@mail.gmail.com
> >>>>>>
> >>>>>> Let me again summarize the problem we are trying to address.
> >>>>>>
> >>>>>>
> >>>>> How will you take care of extra WAL on old master during
> recovery. If it
> >>>>> plays the WAL which has not reached new-master, it can be a
> problem.
> >>
> >>>> you means that there is possible that old master's data ahead of
> new
> >>>> master's data.
> >>
> >>> I mean to say is that WAL of old master can be ahead of new
> master. I understood that
> >>> data files of old master can't be ahead, but I think WAL can be
> ahead.
> >>
> >>>> so there is inconsistent data between those server when fail back.
> right?
> >>>> if so , there is not possible inconsistent. because if you use GUC
> option
> >>>> as his propose (i.g., failback_safe_standby_mode = remote_flush),
> >>>> when old master is working fine, all file system level changes
> aren't
> >>>> done before WAL replicated.
> >>
> >>> Would the propose patch will take care that old master's WAL is
> also not ahead in some way?
> >>> If yes, I think i am missing some point.
> >
> >> yes it will happen that old master's WAL ahead of new master's WAL
> as you said.
> >> but I think that we can solve them by delete all WAL file when old
> >> master starts as new standby.
> >
> > I think ideally, it should reset WAL location at the point where new
> master has forrked off.
> > In such a scenario it would be difficult for user who wants to get a
> dump of some data in
> > old master which hasn't gone to new master. I am not sure if such a
> need is there for real users, but if it
> > is there, then providing this solution will have some drawbacks.
> I think that we can dumping data before all WAL files deleting. All
> WAL files deleting is done when old master starts as new standby.
Can we dump data without starting server?
--
Regards,
-------
Sawada Masahiko
On Wednesday, June 19, 2013 10:45 PM Sawada Masahiko wrote: On Tuesday, June 18, 2013, Amit Kapila wrote: On Tuesday, June 18, 2013 12:18 AM Sawada Masahiko wrote: > On Sun, Jun 16, 2013 at 2:00 PM, Amit kapila <amit.kapila@huawei.com> > wrote: > > On Saturday, June 15, 2013 8:29 PM Sawada Masahiko wrote: > > On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila > <amit.kapila@huawei.com> wrote: > >> > >> On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote: > >> On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila > <amit.kapila@huawei.com> wrote: > >>> On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote: > >>>> Hello, > >>> >>> I think that we can dumping data before all WAL files deleting. All >>> WAL files deleting is done when old master starts as new standby. >> Can we dump data without starting server? >Sorry I made a mistake. We can't it. > this proposing patch need to be able to also handle such scenario in future. I am not sure the purposed patch can handle it so easily, but I think if others also felt it important, then a method shouldbe a provided to user for extracting his last committed data. With Regards, Amit Kapila.
On Mon, Jun 17, 2013 at 8:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 17 June 2013 09:03, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > >> I agree. We should probably find a better name for this. Any suggestions ? > > err, I already made one... > >>> But that's not the whole story. I can see some utility in a patch that >>> makes all WAL transfer synchronous, rather than just commits. Some >>> name like synchronous_transfer might be appropriate. e.g. >>> synchronous_transfer = all | commit (default). > >> Since commits are more foreground in nature and this feature >> does not require us to wait during common foreground activities, we want a >> configuration where master can wait for synchronous transfers at other than >> commits. May we can solve that by having more granular control to the said >> parameter ? >> >>> >>> The idea of another slew of parameters that are very similar to >>> synchronous replication but yet somehow different seems weird. I can't >>> see a reason why we'd want a second lot of parameters. Why not just >>> use the existing ones for sync rep? (I'm surprised the Parameter >>> Police haven't visited you in the night...) Sure, we might want to >>> expand the design for how we specify multi-node sync rep, but that is >>> a different patch. >> >> >> How would we then distinguish between synchronous and the new kind of >> standby ? > > That's not the point. The point is "Why would we have a new kind of > standby?" and therefore why do we need new parameters? > >> I am told, one of the very popular setups for DR is to have one >> local sync standby and one async (may be cascaded by the local sync). Since >> this new feature is more useful for DR because taking a fresh backup on a >> slower link is even more challenging, IMHO we should support such setups. > > ...which still doesn't make sense to me. Lets look at that in detail. > > Take 3 servers, A, B, C with A and B being linked by sync rep, and C > being safety standby at a distance. > > Either A or B is master, except in disaster. So if A is master, then B > would be the failover target. If A fails, then you want to failover to > B. Once B is the target, you want to failback to A as the master. C > needs to follow the new master, whichever it is. > > If you set up sync rep between A and B and this new mode between A and > C. When B becomes the master, you need to failback from B from A, but > you can't because the new mode applied between A and C only, so you > have to failback from C to A. So having the new mode not match with > sync rep means you are forcing people to failback using the slow link > in the common case. > > You might observe that having the two modes match causes problems if A > and B fail, so you are forced to go to C as master and then eventually > failback to A or B across a slow link. That case is less common and > could be solved by extending sync transfer to more/multi nodes. > > It definitely doesn't make sense to have sync rep on anything other > than a subset of sync transfer. So while it may be sensible in the > future to make sync transfer a superset of sync rep nodes, it makes > sense to make them the same config for now. > when 2 servers being synchronous replication, those servers are in same location in many cases. ( e.g., same server room) so taking a full backup and sending it to old master is not issue. this proposal works for situation which those servers are put in remote location and when main site is powered down due to such as power failure or natural disaster occurs. as you said, we can control file (e.g., CLOG, pg_control, etc) replicating by adding synchronous_transfer option. but if to add only this parameter, we can handle only following 2 cases. 1. synchronous standby and make same as failback safe standby 2. asynchronous standby and make same as failback safe standby in above case, adding new parameter might be meaningless. but I think that we should handle case not only case 1,2 but also following case 3, 4 for DR. 3. synchronous standby and make different asynchronous failback safe standby 4. asynchronous standby and make different asynchronous failback safe standby To handles following case 3 and 4, we should set parameter to each standby. so we need to adding new parameter. if we can structure replication in such situation, replication would be more useful for user in slow link. parameter improvement idea is which we extend ini file for to set parameter each standby. For example : -------------------- [Server] standby_name = 'slave1' synchronous_transfer = commit wal_sender_timeout = 30 [Server] standby_name = 'slave2' synchronous_transfer = all wal_sender_timeout = 50 ------------------- there are discussions about such ini file in past. if so, we can set each parameter to each standby. please give me feedback. Regards, ------- Sawada Masahiko
Hi, > parameter improvement idea is which we extend ini file for to set > parameter each standby. For example : > > -------------------- > [Server] > standby_name = 'slave1' > synchronous_transfer = commit > wal_sender_timeout = 30 > [Server] > standby_name = 'slave2' > synchronous_transfer = all > wal_sender_timeout = 50 > ------------------- > Just ask to clarify: Is 'slave2' a failback standby? What does 'synchronous_transfer = all' mean? Does that mean wait during both commit and checkpoint? -- Amit Langote
--------------------
[Server]
standby_name = 'slave1'
synchronous_transfer = commit
wal_sender_timeout = 30
[Server]
standby_name = 'slave2'
synchronous_transfer = all
wal_sender_timeout = 50
-------------------
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
Hi, > >> So our proposal on this problem is that we must ensure that master should > not make any file system level changes without confirming that the >> corresponding WAL record is replicated to the standby. > > How will you take care of extra WAL on old master during recovery. If it > plays the WAL which has not reached new-master, it can be a problem. > I am trying to understand how there would be extra WAL on old master that it would replay and cause inconsistency. Consider how I am picturing it and correct me if I am wrong. 1) Master crashes. So a failback standby becomes new master forking the WAL. 2) Old master is restarted as a standby (now with this patch, without a new base backup). 3) It would try to replay all the WAL it has available and later connect to the new master also following the timeline switch (the switch might happen using archived WAL and timeline history file OR the new switch-over-streaming-replication-connection as of 9.3, right?) * in (3), when the new standby/old master is replaying WAL, from where is it picking the WAL? Does it first replay all the WAL in pg_xlog before archive? Should we make it check for a timeline history file in archive before it starts replaying any WAL? * And, would the new master, before forking the WAL, replay all the WAL that is necessary to come to state (of data directory) that the old master was just before it crashed? Am I missing something here? -- Amit Langote
On Tue, Jun 25, 2013 at 12:19 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > > > > On Mon, Jun 24, 2013 at 7:17 PM, Sawada Masahiko <sawada.mshk@gmail.com> > wrote: >> >> >> >> -------------------- >> [Server] >> standby_name = 'slave1' >> synchronous_transfer = commit >> wal_sender_timeout = 30 >> [Server] >> standby_name = 'slave2' >> synchronous_transfer = all >> wal_sender_timeout = 50 >> ------------------- > > > What different values/modes you are thinking for synchronous_transfer ? IMHO > only "commit" and "all" may not be enough. As I suggested upthread, we may > need an additional mode, say "data", which will ensure synchronous WAL > transfer before making any file system changes. We need this separate mode > because the failback safe (or whatever we call it) standby need not wait on > the commits and it's important to avoid that wait since it comes in a direct > path of client transactions. > > If we are doing it, I wonder if an additional mode "none" also makes sense > so that users can also control asynchronous standbys via the same mechanism. I made mistake how to use name of parameter between synchronous_transfer and failback_safe_standby_mode. it means that we control file system changes using failback_safe_standby_mode. if failback_safe_standby_mode is set 'remote_flush', master server wait for flushing all data page in standby server (e.g., CLOG, pg_control). right? for example: ------ [server] standby_name = 'slave1' failback_safe_standby_mode = remote_flush wal_sender_timeout = 50 ------ in this case, we should also set synchronous_commit and synchronous_level to each standby server. that is, do we need to set following 3 parameters for supporting case 3,4 as I said? -synchronous_commit = on/off/local/remote_write -failback_safe_standby_mode = off/remote_write/remote_flush -synchronous_level = sync/async (this parameter means that standby server is connected using which mode (sync/async) .) please give me your feedback. Regards, ------- Sawada Masahiko
On Tuesday, June 25, 2013 10:23 AM Amit Langote wrote: > Hi, > > > > >> So our proposal on this problem is that we must ensure that master > should > > not make any file system level changes without confirming that the > >> corresponding WAL record is replicated to the standby. > > > > How will you take care of extra WAL on old master during recovery. > If it > > plays the WAL which has not reached new-master, it can be a problem. > > > > I am trying to understand how there would be extra WAL on old master > that it would replay and cause inconsistency. Consider how I am > picturing it and correct me if I am wrong. > > 1) Master crashes. So a failback standby becomes new master forking the > WAL. > 2) Old master is restarted as a standby (now with this patch, without > a new base backup). > 3) It would try to replay all the WAL it has available and later > connect to the new master also following the timeline switch (the > switch might happen using archived WAL and timeline history file OR > the new switch-over-streaming-replication-connection as of 9.3, > right?) > > * in (3), when the new standby/old master is replaying WAL, from where > is it picking the WAL? Yes, this is the point which can lead to inconsistency, new standby/old master will replay WALafter the last successful checkpoint, for which he get info from control file. It is picking WAL from the location whereit was logged when it was active (pg_xlog). > Does it first replay all the WAL in pg_xlog > before archive? Should we make it check for a timeline history file in > archive before it starts replaying any WAL? I have really not thought what is best solution for problem. > * And, would the new master, before forking the WAL, replay all the > WAL that is necessary to come to state (of data directory) that the > old master was just before it crashed? I don't think new master has any correlation with old master's data directory, Rather it will replay the WAL it has received/flushed before start acting as master. With Regards, Amit Kapila.
On Mon, Jun 24, 2013 at 10:47 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > 1. synchronous standby and make same as failback safe standby > 2. asynchronous standby and make same as failback safe standby > > in above case, adding new parameter might be meaningless. but I think > that we should handle case not only case 1,2 but also following case > 3, 4 for DR. > to support case 1 and 2, I'm thinking that following another 2 ideas. ------------------------ we add synchronous_transfer( commit/ data_flush/ all) . This GUC will only affect the standbys mentioned in the list of synchronous_standby_names. 1. If synchronous_transfer is set to commit, current synchronous replication behavior is achieved 2. If synchronous_transfer is set to data_flush, the standbys named in synchronous_standby_names will act as ASYNC failback safe standbys 3. If synchronous_transfer is set to all, the standbys named in synchronous_standby_names will act as SYNC failback safe standbys in this approach, 3 is confusing because we are actually setting up a ASYNC standby by using the GUCs meant for sync standby setup. ------------------------- we extend synchronous_commit so that it also accepts like 'all'. ( this approach dosen't provide 'synchronous_transfer' parameter) 'all' value means that master wait for not only replicated WAL but also replicated data page (e.g., CLOG, pg_control). and master changes the process by whether standby is connected as sync or async. 1. If synchronous_commit is set to 'all' and synchronous_standby_name is set to standby name, the standbys named in synchronous_standby_names will act as SYNC failback safe standby. 2. If synchronous_commit is set to 'all' and synchronous_standby_name is NOT set to standby name, the standbys which is connecting to masterwill act as ASYNC failback safe standby. one problem with not naming ASYNC standby explicitly is that the master has no clue which standby to wait on. If it chooses to wait on all async standbys for failback-safety that can be quite detrimental, especially because async standbys can become easily unreachable if they are on a slow link or at remote location. please give me feedback. Regards, ------- Sawada Masahiko
On Mon, Jun 17, 2013 at 7:48 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> I am told, one of the very popular setups for DR is to have one >> local sync standby and one async (may be cascaded by the local sync). Since >> this new feature is more useful for DR because taking a fresh backup on a >> slower link is even more challenging, IMHO we should support such setups. > > ...which still doesn't make sense to me. Lets look at that in detail. > > Take 3 servers, A, B, C with A and B being linked by sync rep, and C > being safety standby at a distance. > > Either A or B is master, except in disaster. So if A is master, then B > would be the failover target. If A fails, then you want to failover to > B. Once B is the target, you want to failback to A as the master. C > needs to follow the new master, whichever it is. > > If you set up sync rep between A and B and this new mode between A and > C. When B becomes the master, you need to failback from B from A, but > you can't because the new mode applied between A and C only, so you > have to failback from C to A. So having the new mode not match with > sync rep means you are forcing people to failback using the slow link > in the common case. It's true that in this scenario that doesn't really make sense, but I still think they are separate properties. You could certainly want synchronous replication without this new property, if you like the data-loss guarantees that sync rep provides but don't care about failback. You could also want this new property without synchronous replication, if you don't need the data-loss guarantees that sync rep provides but you do care about fast failback. I admit it seems unlikely that you would use both features but not target them at the same machines, although maybe: perhaps you have a sync standby and an async standby and want this new property with respect to both of them. In my admittedly limited experience, the use case for a lot of this technology is in the cloud. The general strategy seems to be: at the first sign of trouble, kill the offending instance and fail over. This can result in failing over pretty frequently, and needing it to be fast. There may be no real hardware problem; indeed, the failover may be precipitated by network conditions or overload of the physical host backing the virtual machine or any number of other nonphysical problems. I can see this being useful in that environment, even for async standbys. People can apparently tolerate a brief interruption while their primary gets killed off and connections are re-established with the new master, but they need the failover to be fast. The problem with the status quo is that, even if the first failover is fast, the second one isn't, because it has to wait behind rebuilding the original master. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 26, 2013 at 1:40 PM, Amit Kapila <amit.kapila@huawei.com> wrote: > On Tuesday, June 25, 2013 10:23 AM Amit Langote wrote: >> Hi, >> >> > >> >> So our proposal on this problem is that we must ensure that master >> should >> > not make any file system level changes without confirming that the >> >> corresponding WAL record is replicated to the standby. >> > >> > How will you take care of extra WAL on old master during recovery. >> If it >> > plays the WAL which has not reached new-master, it can be a problem. >> > >> >> I am trying to understand how there would be extra WAL on old master >> that it would replay and cause inconsistency. Consider how I am >> picturing it and correct me if I am wrong. >> >> 1) Master crashes. So a failback standby becomes new master forking the >> WAL. >> 2) Old master is restarted as a standby (now with this patch, without >> a new base backup). >> 3) It would try to replay all the WAL it has available and later >> connect to the new master also following the timeline switch (the >> switch might happen using archived WAL and timeline history file OR >> the new switch-over-streaming-replication-connection as of 9.3, >> right?) >> >> * in (3), when the new standby/old master is replaying WAL, from where >> is it picking the WAL? > Yes, this is the point which can lead to inconsistency, new standby/old master > will replay WAL after the last successful checkpoint, for which he get info from > control file. It is picking WAL from the location where it was logged when it was active (pg_xlog). > >> Does it first replay all the WAL in pg_xlog >> before archive? Should we make it check for a timeline history file in >> archive before it starts replaying any WAL? > > I have really not thought what is best solution for problem. > >> * And, would the new master, before forking the WAL, replay all the >> WAL that is necessary to come to state (of data directory) that the >> old master was just before it crashed? > > I don't think new master has any correlation with old master's data directory, > Rather it will replay the WAL it has received/flushed before start acting as master. when old master fail over, WAL which ahead of new master might be broken data. so that when user want to dump from old master, there is possible to fail dump. it is just idea, we extend parameter which is used in recovery.conf like 'follow_master_force'. this parameter accepts 'on' and 'off', is effective only when standby_mode is set to on. if both parameters 'follow_master_force' and 'standby_mode' is set to 'on', 1. when standby server starts and starts to recovery, standby server skip to apply WAL which is in pg_xlog, and request WAL from latest checkpoint LSN to master server. 2. master server receives LSN which is standby server latest checkpoint, and compare between LSN of standby and LSN of master latest checkpoint. if those LSN match, master will send WAL from latest checkpoint LSN. if not, master will inform standby that failed. 3. standby will fork WAL, and apply WAL which is sent from master continuity. in this approach, user who want to dump from old master will set 'off' to follow_master_force and standby_mode, and gets the dump of old master after master started. OTOH, user who want to starts replication force will set 'on' to both parameter. please give me feedback. Regards, ------- Sawada Masahiko
On Friday, June 28, 2013 10:41 AM Sawada Masahiko wrote: > On Wed, Jun 26, 2013 at 1:40 PM, Amit Kapila <amit.kapila@huawei.com> > wrote: > > On Tuesday, June 25, 2013 10:23 AM Amit Langote wrote: > >> Hi, > >> > >> > > >> >> So our proposal on this problem is that we must ensure that > master > >> should > >> > not make any file system level changes without confirming that the > >> >> corresponding WAL record is replicated to the standby. > >> > > >> > How will you take care of extra WAL on old master during > recovery. > >> If it > >> > plays the WAL which has not reached new-master, it can be a > problem. > >> > > >> > >> I am trying to understand how there would be extra WAL on old master > >> that it would replay and cause inconsistency. Consider how I am > >> picturing it and correct me if I am wrong. > >> > >> 1) Master crashes. So a failback standby becomes new master forking > the > >> WAL. > >> 2) Old master is restarted as a standby (now with this patch, > without > >> a new base backup). > >> 3) It would try to replay all the WAL it has available and later > >> connect to the new master also following the timeline switch (the > >> switch might happen using archived WAL and timeline history file OR > >> the new switch-over-streaming-replication-connection as of 9.3, > >> right?) > >> > >> * in (3), when the new standby/old master is replaying WAL, from > where > >> is it picking the WAL? > > Yes, this is the point which can lead to inconsistency, new > standby/old master > > will replay WAL after the last successful checkpoint, for which he > get info from > > control file. It is picking WAL from the location where it was > logged when it was active (pg_xlog). > > > >> Does it first replay all the WAL in pg_xlog > >> before archive? Should we make it check for a timeline history file > in > >> archive before it starts replaying any WAL? > > > > I have really not thought what is best solution for problem. > > > >> * And, would the new master, before forking the WAL, replay all the > >> WAL that is necessary to come to state (of data directory) that the > >> old master was just before it crashed? > > > > I don't think new master has any correlation with old master's data > directory, > > Rather it will replay the WAL it has received/flushed before start > acting as master. > when old master fail over, WAL which ahead of new master might be > broken data. so that when user want to dump from old master, there is > possible to fail dump. > it is just idea, we extend parameter which is used in recovery.conf > like 'follow_master_force'. this parameter accepts 'on' and 'off', is > effective only when standby_mode is set to on. > > if both parameters 'follow_master_force' and 'standby_mode' is set to > 'on', > 1. when standby server starts and starts to recovery, standby server > skip to apply WAL which is in pg_xlog, and request WAL from latest > checkpoint LSN to master server. > 2. master server receives LSN which is standby server latest > checkpoint, and compare between LSN of standby and LSN of master > latest checkpoint. if those LSN match, master will send WAL from > latest checkpoint LSN. if not, master will inform standby that failed. > 3. standby will fork WAL, and apply WAL which is sent from master > continuity. Please consider if this solution has the same problem as mentioned by Robert Hass in below mail: http://www.postgresql.org/message-id/CA+TgmoY4j+p7JY69ry8GpOSMMdZNYqU6dtiONPrcxaVG+SPByg@mail.gmail.com > in this approach, user who want to dump from old master will set 'off' > to follow_master_force and standby_mode, and gets the dump of old > master after master started. OTOH, user who want to starts replication > force will set 'on' to both parameter. I think before going into solution of this problem, it should be confirmed by others whether such a problem needs to be resolved as part of this patch. I have seen that Simon Riggs is a reviewer of this Patch and he hasn't mentioned his views about this problem. So I think it's not worth inventing a solution. Rather I think if all other things are resolved for this patch, then may be in end we can check with Committer, if he thinks that this problem needs to be solved as a separate patch. With Regards, Amit Kapila.
On Tue, Jul 2, 2013 at 2:45 PM, Amit Kapila <amit.kapila@huawei.com> wrote: > On Friday, June 28, 2013 10:41 AM Sawada Masahiko wrote: >> On Wed, Jun 26, 2013 at 1:40 PM, Amit Kapila <amit.kapila@huawei.com> >> wrote: >> > On Tuesday, June 25, 2013 10:23 AM Amit Langote wrote: >> >> Hi, >> >> >> >> > >> >> >> So our proposal on this problem is that we must ensure that >> master >> >> should >> >> > not make any file system level changes without confirming that the >> >> >> corresponding WAL record is replicated to the standby. >> >> > >> >> > How will you take care of extra WAL on old master during >> recovery. >> >> If it >> >> > plays the WAL which has not reached new-master, it can be a >> problem. >> >> > >> >> >> >> I am trying to understand how there would be extra WAL on old master >> >> that it would replay and cause inconsistency. Consider how I am >> >> picturing it and correct me if I am wrong. >> >> >> >> 1) Master crashes. So a failback standby becomes new master forking >> the >> >> WAL. >> >> 2) Old master is restarted as a standby (now with this patch, >> without >> >> a new base backup). >> >> 3) It would try to replay all the WAL it has available and later >> >> connect to the new master also following the timeline switch (the >> >> switch might happen using archived WAL and timeline history file OR >> >> the new switch-over-streaming-replication-connection as of 9.3, >> >> right?) >> >> >> >> * in (3), when the new standby/old master is replaying WAL, from >> where >> >> is it picking the WAL? >> > Yes, this is the point which can lead to inconsistency, new >> standby/old master >> > will replay WAL after the last successful checkpoint, for which he >> get info from >> > control file. It is picking WAL from the location where it was >> logged when it was active (pg_xlog). >> > >> >> Does it first replay all the WAL in pg_xlog >> >> before archive? Should we make it check for a timeline history file >> in >> >> archive before it starts replaying any WAL? >> > >> > I have really not thought what is best solution for problem. >> > >> >> * And, would the new master, before forking the WAL, replay all the >> >> WAL that is necessary to come to state (of data directory) that the >> >> old master was just before it crashed? >> > >> > I don't think new master has any correlation with old master's data >> directory, >> > Rather it will replay the WAL it has received/flushed before start >> acting as master. >> when old master fail over, WAL which ahead of new master might be >> broken data. so that when user want to dump from old master, there is >> possible to fail dump. >> it is just idea, we extend parameter which is used in recovery.conf >> like 'follow_master_force'. this parameter accepts 'on' and 'off', is >> effective only when standby_mode is set to on. >> >> if both parameters 'follow_master_force' and 'standby_mode' is set to >> 'on', >> 1. when standby server starts and starts to recovery, standby server >> skip to apply WAL which is in pg_xlog, and request WAL from latest >> checkpoint LSN to master server. >> 2. master server receives LSN which is standby server latest >> checkpoint, and compare between LSN of standby and LSN of master >> latest checkpoint. if those LSN match, master will send WAL from >> latest checkpoint LSN. if not, master will inform standby that failed. >> 3. standby will fork WAL, and apply WAL which is sent from master >> continuity. > > Please consider if this solution has the same problem as mentioned by Robert Hass in below mail: > http://www.postgresql.org/message-id/CA+TgmoY4j+p7JY69ry8GpOSMMdZNYqU6dtiONPrcxaVG+SPByg@mail.gmail.com > > >> in this approach, user who want to dump from old master will set 'off' >> to follow_master_force and standby_mode, and gets the dump of old >> master after master started. OTOH, user who want to starts replication >> force will set 'on' to both parameter. > > I think before going into solution of this problem, it should be confirmed by others whether such a problem > needs to be resolved as part of this patch. > > I have seen that Simon Riggs is a reviewer of this Patch and he hasn't mentioned his views about this problem. > So I think it's not worth inventing a solution. > Rather I think if all other things are resolved for this patch, then may be in end we can check with Committer, > if he thinks that this problem needs to be solved as a separate patch. thank you for feedback. yes, we can consider separately those problem. and we need to judge that whether it is worth to invent a solution. I think that solving the fundamental of this problem is complex. it might be needs to big change to architecture of replication. so I'm thinking that I'd like to deal of something when we do recovery. if so, I think that if we deal at recovery time, impact to performance is ignored. Regards, ------- Sawada Masahiko
On Tuesday, July 02, 2013 11:16 AM Amit Kapila wrote: > On Friday, June 28, 2013 10:41 AM Sawada Masahiko wrote: > > On Wed, Jun 26, 2013 at 1:40 PM, Amit Kapila <amit.kapila@huawei.com> > > wrote: > > > On Tuesday, June 25, 2013 10:23 AM Amit Langote wrote: > > >> Hi, > > >> > > >> > > > >> >> So our proposal on this problem is that we must ensure that > > master > > >> should > > >> > not make any file system level changes without confirming that > the > > >> >> corresponding WAL record is replicated to the standby. > > >> > > > >> > How will you take care of extra WAL on old master during > > recovery. > > >> If it > > >> > plays the WAL which has not reached new-master, it can be a > > problem. > > >> > > > >> > > >> I am trying to understand how there would be extra WAL on old > master > > >> that it would replay and cause inconsistency. Consider how I am > > >> picturing it and correct me if I am wrong. > > >> > > >> 1) Master crashes. So a failback standby becomes new master > forking > > the > > >> WAL. > > >> 2) Old master is restarted as a standby (now with this patch, > > without > > >> a new base backup). > > >> 3) It would try to replay all the WAL it has available and later > > >> connect to the new master also following the timeline switch (the > > >> switch might happen using archived WAL and timeline history file > OR > > >> the new switch-over-streaming-replication-connection as of 9.3, > > >> right?) > > >> > > >> * in (3), when the new standby/old master is replaying WAL, from > > where > > >> is it picking the WAL? > > > Yes, this is the point which can lead to inconsistency, new > > standby/old master > > > will replay WAL after the last successful checkpoint, for which > he > > get info from > > > control file. It is picking WAL from the location where it was > > logged when it was active (pg_xlog). > > > > > >> Does it first replay all the WAL in pg_xlog > > >> before archive? Should we make it check for a timeline history > file > > in > > >> archive before it starts replaying any WAL? > > > > > > I have really not thought what is best solution for problem. > > > > > >> * And, would the new master, before forking the WAL, replay all > the > > >> WAL that is necessary to come to state (of data directory) that > the > > >> old master was just before it crashed? > > > > > > I don't think new master has any correlation with old master's data > > directory, > > > Rather it will replay the WAL it has received/flushed before start > > acting as master. > > when old master fail over, WAL which ahead of new master might be > > broken data. so that when user want to dump from old master, there is > > possible to fail dump. > > it is just idea, we extend parameter which is used in recovery.conf > > like 'follow_master_force'. this parameter accepts 'on' and 'off', is > > effective only when standby_mode is set to on. > > > > if both parameters 'follow_master_force' and 'standby_mode' is set to > > 'on', > > 1. when standby server starts and starts to recovery, standby server > > skip to apply WAL which is in pg_xlog, and request WAL from latest > > checkpoint LSN to master server. > > 2. master server receives LSN which is standby server latest > > checkpoint, and compare between LSN of standby and LSN of master > > latest checkpoint. if those LSN match, master will send WAL from > > latest checkpoint LSN. if not, master will inform standby that > failed. > > 3. standby will fork WAL, and apply WAL which is sent from master > > continuity. > > Please consider if this solution has the same problem as mentioned by > Robert Hass in below mail: Sorry typo error, it's Robert Haas mail: > http://www.postgresql.org/message- > id/CA+TgmoY4j+p7JY69ry8GpOSMMdZNYqU6dtiONPrcxaVG+SPByg@mail.gmail.com > > > > in this approach, user who want to dump from old master will set > 'off' > > to follow_master_force and standby_mode, and gets the dump of old > > master after master started. OTOH, user who want to starts > replication > > force will set 'on' to both parameter. > > I think before going into solution of this problem, it should be > confirmed by others whether such a problem > needs to be resolved as part of this patch. > > I have seen that Simon Riggs is a reviewer of this Patch and he hasn't > mentioned his views about this problem. > So I think it's not worth inventing a solution. > > Rather I think if all other things are resolved for this patch, then > may be in end we can check with Committer, > if he thinks that this problem needs to be solved as a separate patch.
On Mon, Jun 17, 2013 at 8:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 17 June 2013 09:03, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > >> I agree. We should probably find a better name for this. Any suggestions ? > > err, I already made one... > >>> But that's not the whole story. I can see some utility in a patch that >>> makes all WAL transfer synchronous, rather than just commits. Some >>> name like synchronous_transfer might be appropriate. e.g. >>> synchronous_transfer = all | commit (default). > >> Since commits are more foreground in nature and this feature >> does not require us to wait during common foreground activities, we want a >> configuration where master can wait for synchronous transfers at other than >> commits. May we can solve that by having more granular control to the said >> parameter ? >> >>> >>> The idea of another slew of parameters that are very similar to >>> synchronous replication but yet somehow different seems weird. I can't >>> see a reason why we'd want a second lot of parameters. Why not just >>> use the existing ones for sync rep? (I'm surprised the Parameter >>> Police haven't visited you in the night...) Sure, we might want to >>> expand the design for how we specify multi-node sync rep, but that is >>> a different patch. >> >> >> How would we then distinguish between synchronous and the new kind of >> standby ? > > That's not the point. The point is "Why would we have a new kind of > standby?" and therefore why do we need new parameters? > >> I am told, one of the very popular setups for DR is to have one >> local sync standby and one async (may be cascaded by the local sync). Since >> this new feature is more useful for DR because taking a fresh backup on a >> slower link is even more challenging, IMHO we should support such setups. > > ...which still doesn't make sense to me. Lets look at that in detail. > > Take 3 servers, A, B, C with A and B being linked by sync rep, and C > being safety standby at a distance. > > Either A or B is master, except in disaster. So if A is master, then B > would be the failover target. If A fails, then you want to failover to > B. Once B is the target, you want to failback to A as the master. C > needs to follow the new master, whichever it is. > > If you set up sync rep between A and B and this new mode between A and > C. When B becomes the master, you need to failback from B from A, but > you can't because the new mode applied between A and C only, so you > have to failback from C to A. So having the new mode not match with > sync rep means you are forcing people to failback using the slow link > in the common case. > > You might observe that having the two modes match causes problems if A > and B fail, so you are forced to go to C as master and then eventually > failback to A or B across a slow link. That case is less common and > could be solved by extending sync transfer to more/multi nodes. > > It definitely doesn't make sense to have sync rep on anything other > than a subset of sync transfer. So while it may be sensible in the > future to make sync transfer a superset of sync rep nodes, it makes > sense to make them the same config for now. I have updated the patch. we support following 2 cases. 1. SYNC server and also make same failback safe standby server 2. ASYNC server and also make same failback safe standby server 1. changed name of parameter give up 'failback_safe_standby_names' parameter from the first patch. and changed name of parameterfrom 'failback_safe_mode ' to 'synchronous_transfer'. this parameter accepts 'all', 'data_flush' and 'commit'. -'commit' 'commit' means that master waits for corresponding WAL to flushed to disk of standby server on commits. but master doesn't waits for replicated data pages. -'data_flush' 'data_flush' means that master waits for replicated data page (e.g, CLOG, pg_control) before flush to disk of master server. but if user set to 'data_flush' to this parameter, 'synchronous_commit' values is ignored even if user set 'synchronous_commit'. -'all' 'all' means that master waits for replicated WAL and data page. 2. put SyncRepWaitForLSN() function into XLogFlush() function we have put SyncRepWaitForLSN() function into XLogFlush() function, and change argument of XLogFlush(). they are setup case and need to set parameters. - SYNC server and also make same failback safe standgy server (case 1) synchronous_transfer = all synchronous_commit = remote_write/onsynchronous_standby_names = <ServerName> - ASYNC server and also make same failback safe standgy server (case 2) synchronous_transfer = data_flush (synchronous_commitvalues is ignored) - default SYNC replication synchronous_transfer = commit synchronous_commit = on synchronous_standby_names = <ServerName> - default ASYNC replication synchronous_transfer = commit ToDo 1. currently this patch supports synchronous transfer. so we can't set different synchronous transfer mode to each server. we need to improve the patch for support following cases. - SYNC standbyand make separate ASYNC failback safe standby - ASYNC standby and make separate ASYNC failback safe standby 2. we have not measure performance yet. we need to measure perfomance. please give me your feedback. Regards, ------- Sawada Masahiko
On Sun, Jul 7, 2013 at 4:19 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > On Mon, Jun 17, 2013 at 8:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 17 June 2013 09:03, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: >> >>> I agree. We should probably find a better name for this. Any suggestions ? >> >> err, I already made one... >> >>>> But that's not the whole story. I can see some utility in a patch that >>>> makes all WAL transfer synchronous, rather than just commits. Some >>>> name like synchronous_transfer might be appropriate. e.g. >>>> synchronous_transfer = all | commit (default). >> >>> Since commits are more foreground in nature and this feature >>> does not require us to wait during common foreground activities, we want a >>> configuration where master can wait for synchronous transfers at other than >>> commits. May we can solve that by having more granular control to the said >>> parameter ? >>> >>>> >>>> The idea of another slew of parameters that are very similar to >>>> synchronous replication but yet somehow different seems weird. I can't >>>> see a reason why we'd want a second lot of parameters. Why not just >>>> use the existing ones for sync rep? (I'm surprised the Parameter >>>> Police haven't visited you in the night...) Sure, we might want to >>>> expand the design for how we specify multi-node sync rep, but that is >>>> a different patch. >>> >>> >>> How would we then distinguish between synchronous and the new kind of >>> standby ? >> >> That's not the point. The point is "Why would we have a new kind of >> standby?" and therefore why do we need new parameters? >> >>> I am told, one of the very popular setups for DR is to have one >>> local sync standby and one async (may be cascaded by the local sync). Since >>> this new feature is more useful for DR because taking a fresh backup on a >>> slower link is even more challenging, IMHO we should support such setups. >> >> ...which still doesn't make sense to me. Lets look at that in detail. >> >> Take 3 servers, A, B, C with A and B being linked by sync rep, and C >> being safety standby at a distance. >> >> Either A or B is master, except in disaster. So if A is master, then B >> would be the failover target. If A fails, then you want to failover to >> B. Once B is the target, you want to failback to A as the master. C >> needs to follow the new master, whichever it is. >> >> If you set up sync rep between A and B and this new mode between A and >> C. When B becomes the master, you need to failback from B from A, but >> you can't because the new mode applied between A and C only, so you >> have to failback from C to A. So having the new mode not match with >> sync rep means you are forcing people to failback using the slow link >> in the common case. >> >> You might observe that having the two modes match causes problems if A >> and B fail, so you are forced to go to C as master and then eventually >> failback to A or B across a slow link. That case is less common and >> could be solved by extending sync transfer to more/multi nodes. >> >> It definitely doesn't make sense to have sync rep on anything other >> than a subset of sync transfer. So while it may be sensible in the >> future to make sync transfer a superset of sync rep nodes, it makes >> sense to make them the same config for now. > I have updated the patch. > > we support following 2 cases. > 1. SYNC server and also make same failback safe standby server > 2. ASYNC server and also make same failback safe standby server > > 1. changed name of parameter > give up 'failback_safe_standby_names' parameter from the first patch. > and changed name of parameter from 'failback_safe_mode ' to > 'synchronous_transfer'. > this parameter accepts 'all', 'data_flush' and 'commit'. > > -'commit' > 'commit' means that master waits for corresponding WAL to flushed > to disk of standby server on commits. > but master doesn't waits for replicated data pages. > > -'data_flush' > 'data_flush' means that master waits for replicated data page > (e.g, CLOG, pg_control) before flush to disk of master server. > but if user set to 'data_flush' to this parameter, > 'synchronous_commit' values is ignored even if user set > 'synchronous_commit'. > > -'all' > 'all' means that master waits for replicated WAL and data page. > > 2. put SyncRepWaitForLSN() function into XLogFlush() function > we have put SyncRepWaitForLSN() function into XLogFlush() function, > and change argument of XLogFlush(). > > they are setup case and need to set parameters. > > - SYNC server and also make same failback safe standgy server (case 1) > synchronous_transfer = all > synchronous_commit = remote_write/on > synchronous_standby_names = <ServerName> > > - ASYNC server and also make same failback safe standgy server (case 2) > synchronous_transfer = data_flush > (synchronous_commit values is ignored) > > - default SYNC replication > synchronous_transfer = commit > synchronous_commit = on > synchronous_standby_names = <ServerName> > > - default ASYNC replication > synchronous_transfer = commit > > ToDo > 1. currently this patch supports synchronous transfer. so we can't set > different synchronous transfer mode to each server. > we need to improve the patch for support following cases. > - SYNC standby and make separate ASYNC failback safe standby > - ASYNC standby and make separate ASYNC failback safe standby > > 2. we have not measure performance yet. we need to measure perfomance. > > please give me your feedback. > > Regards, > > ------- > Sawada Masahiko I'm sorry. I forgot attached the patch. Please see the attached file. Regards, ------- Sawada Masahiko
Attachment
On Sun, Jul 7, 2013 at 4:27 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > On Sun, Jul 7, 2013 at 4:19 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >> On Mon, Jun 17, 2013 at 8:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> On 17 June 2013 09:03, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: >>> >>>> I agree. We should probably find a better name for this. Any suggestions ? >>> >>> err, I already made one... >>> >>>>> But that's not the whole story. I can see some utility in a patch that >>>>> makes all WAL transfer synchronous, rather than just commits. Some >>>>> name like synchronous_transfer might be appropriate. e.g. >>>>> synchronous_transfer = all | commit (default). >>> >>>> Since commits are more foreground in nature and this feature >>>> does not require us to wait during common foreground activities, we want a >>>> configuration where master can wait for synchronous transfers at other than >>>> commits. May we can solve that by having more granular control to the said >>>> parameter ? >>>> >>>>> >>>>> The idea of another slew of parameters that are very similar to >>>>> synchronous replication but yet somehow different seems weird. I can't >>>>> see a reason why we'd want a second lot of parameters. Why not just >>>>> use the existing ones for sync rep? (I'm surprised the Parameter >>>>> Police haven't visited you in the night...) Sure, we might want to >>>>> expand the design for how we specify multi-node sync rep, but that is >>>>> a different patch. >>>> >>>> >>>> How would we then distinguish between synchronous and the new kind of >>>> standby ? >>> >>> That's not the point. The point is "Why would we have a new kind of >>> standby?" and therefore why do we need new parameters? >>> >>>> I am told, one of the very popular setups for DR is to have one >>>> local sync standby and one async (may be cascaded by the local sync). Since >>>> this new feature is more useful for DR because taking a fresh backup on a >>>> slower link is even more challenging, IMHO we should support such setups. >>> >>> ...which still doesn't make sense to me. Lets look at that in detail. >>> >>> Take 3 servers, A, B, C with A and B being linked by sync rep, and C >>> being safety standby at a distance. >>> >>> Either A or B is master, except in disaster. So if A is master, then B >>> would be the failover target. If A fails, then you want to failover to >>> B. Once B is the target, you want to failback to A as the master. C >>> needs to follow the new master, whichever it is. >>> >>> If you set up sync rep between A and B and this new mode between A and >>> C. When B becomes the master, you need to failback from B from A, but >>> you can't because the new mode applied between A and C only, so you >>> have to failback from C to A. So having the new mode not match with >>> sync rep means you are forcing people to failback using the slow link >>> in the common case. >>> >>> You might observe that having the two modes match causes problems if A >>> and B fail, so you are forced to go to C as master and then eventually >>> failback to A or B across a slow link. That case is less common and >>> could be solved by extending sync transfer to more/multi nodes. >>> >>> It definitely doesn't make sense to have sync rep on anything other >>> than a subset of sync transfer. So while it may be sensible in the >>> future to make sync transfer a superset of sync rep nodes, it makes >>> sense to make them the same config for now. >> I have updated the patch. >> >> we support following 2 cases. >> 1. SYNC server and also make same failback safe standby server >> 2. ASYNC server and also make same failback safe standby server >> >> 1. changed name of parameter >> give up 'failback_safe_standby_names' parameter from the first patch. >> and changed name of parameter from 'failback_safe_mode ' to >> 'synchronous_transfer'. >> this parameter accepts 'all', 'data_flush' and 'commit'. >> >> -'commit' >> 'commit' means that master waits for corresponding WAL to flushed >> to disk of standby server on commits. >> but master doesn't waits for replicated data pages. >> >> -'data_flush' >> 'data_flush' means that master waits for replicated data page >> (e.g, CLOG, pg_control) before flush to disk of master server. >> but if user set to 'data_flush' to this parameter, >> 'synchronous_commit' values is ignored even if user set >> 'synchronous_commit'. >> >> -'all' >> 'all' means that master waits for replicated WAL and data page. >> >> 2. put SyncRepWaitForLSN() function into XLogFlush() function >> we have put SyncRepWaitForLSN() function into XLogFlush() function, >> and change argument of XLogFlush(). >> >> they are setup case and need to set parameters. >> >> - SYNC server and also make same failback safe standgy server (case 1) >> synchronous_transfer = all >> synchronous_commit = remote_write/on >> synchronous_standby_names = <ServerName> >> >> - ASYNC server and also make same failback safe standgy server (case 2) >> synchronous_transfer = data_flush >> (synchronous_commit values is ignored) >> >> - default SYNC replication >> synchronous_transfer = commit >> synchronous_commit = on >> synchronous_standby_names = <ServerName> >> >> - default ASYNC replication >> synchronous_transfer = commit >> >> ToDo >> 1. currently this patch supports synchronous transfer. so we can't set >> different synchronous transfer mode to each server. >> we need to improve the patch for support following cases. >> - SYNC standby and make separate ASYNC failback safe standby >> - ASYNC standby and make separate ASYNC failback safe standby >> >> 2. we have not measure performance yet. we need to measure perfomance. >> >> please give me your feedback. >> >> Regards, >> >> ------- >> Sawada Masahiko > > I'm sorry. I forgot attached the patch. > Please see the attached file. > > Regards, > > ------- > Sawada Masahiko I found a bug which occurred when we do vacuum, and have fixed it. yesterday (8th July) "Improve scalability of WAL insertions" patch is committed to HEAD. so v2 patch does not apply to HEAD now. I also have fixed it to be applicable to HEAD. please find the attached patch. Regards, ------- Sawada Masahiko
Attachment
On Tue, Jul 9, 2013 at 11:45 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > On Sun, Jul 7, 2013 at 4:27 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > I found a bug which occurred when we do vacuum, and have fixed it. > yesterday (8th July) "Improve scalability of WAL insertions" patch is > committed to HEAD. so v2 patch does not apply to HEAD now. > I also have fixed it to be applicable to HEAD. > > please find the attached patch. > > Regards, > > ------- > Sawada Masahiko I have fixed that master server doesn't waits for the WAL to be flushed to disk of standby when master server execute FlushBuffer(), and have attached v4 patch. please find the attached patch. Regards, ------- Sawada Masahiko
Attachment
> ToDo > 1. currently this patch supports synchronous transfer. so we can't set > different synchronous transfer mode to each server. > we need to improve the patch for support following cases. > - SYNC standby and make separate ASYNC failback safe standby > - ASYNC standby and make separate ASYNC failback safe standby > > 2. we have not measure performance yet. we need to measure perfomance. Here are the tests results, showing the performance overhead of the patch ( failback_safe_standby_v4.patch ): Tests are carried out in two different scenarios: 1. Tests with Fast transaction workloads. 2. Test with large loads. Test Type-1: Tests with pgbench (*Tests with fast transaction workloads*) Notes: 1. These test are for testing the performance overhead caused by the patch for fast transaction workloads. 2. Tests are performed with the pgbench benchmark and performance measurement factor is the TPS value. 3. Values represents the TPS for 4 runs,and the last value represents the average of all the runs. Settings for tests: transaction type: TPC-B (sort of) scaling factor: 300 query mode: simple number of clients:150 number of threads:1 duration:1800 s Analysis of results: 1) Synchronous Replication :(753.06, 748.81, 748.38, 747.21, Avg- 747.2) 2) Synchronous Replication + Failsafe standby (commit) : (729.13,724.33 , 713.59 , 710.79, Avg- 719.46) 3) Synchronous Replication + Failsafe standby (all) : (692.08, 688.08, 711.23,711.62, Avg- 700.75) 4) Asynchronous Replication :(1008.42, 993.39, 986.80 ,1028.46 , Avg-1004.26 ) 5) Asynchronous Replication + Failsafe standby (commit) : (974.49, 978.60 ,969.11, 957.18 , Avg- 969.84) 6) Asynchronous Replication + Failsafe standby (data_flush) : (1011.79, 992.05, 1030.20,940.50 , Avg- 993.63) In above test results the performance numbers are very close to each other, also because of noise they show variation. hence following is approximate conclusion about the overhead of patch. 1. Streaming replication + synchronous_transfer (all , data_flush):: a) On an average, synchronous replication combined with synchronous_transfer (all ) causes 6.21 % performance overhead. b) On an average, asynchronous streaming replication combined synchronous_transfer (data_flush ) causes averagely 1.05 % performance overhead. 2. Streaming replication + synchronous_transfer (commit): a) On an average, synchronous replication combined with synchronous_transfer (commit ) causes 3.71 % performance overhead. b) On an average, asynchronous streaming replication combined with synchronous_transfer (commit) causes averagely 3.42 % performance overhead. Test Type-2: Tests with pgbench -i (*Tests with large loads:*) Notes: 1. These test are for testing the performance overhead caused by the patch for large loads and index builds. 2. Tests are performed with the pgbench -i (initialization of test data i.e the time taken for creating tables of pgbench, inserting tuples and building primary keys.) 3. The performance measurement factor is the wall clock time for pgbench -i (measured with time command). 4. Values represents the Wall clock time for 4 runs and the last value represents the average of all the runs. pgbench settings: Scale factor: 300 ( Database size - 4.3873 GB) Test results: 1) Synchronous Replication : (126.98, 133.83, 127.77, 129.70, Avg-129.57) (second) 2) Synchronous Replication + synchronous_transfer (commit) : (132.87, 125.85, 133.91, 134.61, Avg-131.81) (second) 3) Synchronous Replication + synchronous_transfer (all) : (133.59, 132.82, 134.20, 135.22, Avg-133.95) (second) 4) Asynchronous Replication : ( 126.75 , 136.95, 130.42, 127.77, 130.47) (second) 5) Asynchronous Replication + synchronous_transfer (commit) : (128.13, 133.06, 127.62, 130.70, Avg-129.87) (second) 6) Asynchronous Replication + synchronous_transfer (data_flush) : (134.55 , 139.90, 144.47, 143.85, Avg-140.69) (second) In above test results the performance numbers are very close to each other, also because of noise they show variation. hence following is approximate conclusion about the overhead of patch. 1. Streaming replication + synchronous_transfer (all , data_flush):: a) On an average, synchronous replication combined with synchronous_transfer (all ) causes 3.38 % performance overhead. b) On an average, asynchronous streaming replication combined synchronous_transfer (data_flush ) causes averagely 7.83 % performance overhead. 2. Streaming replication + synchronous_transfer (commit): a) On an average, synchronous replication combined with synchronous_transfer (commit ) causes 1.72 % performance overhead. b) On an average, asynchronous streaming replication combined with synchronous_transfer (commit) causes averagely (-0.45) % performance overhead. The test results for both the cases (large loads and fast transactions) shows variation because of noise, But we can observe that approximately patch causes 3-4% performance overhead. Regards, Samrat Revgade
On Thu, 2013-07-11 at 23:42 +0900, Sawada Masahiko wrote: > please find the attached patch. Please fix these compiler warnings: xlog.c:3117:2: warning: implicit declaration of function ‘SyncRepWaitForLSN’ [-Wimplicit-function-declaration] syncrep.c:414:6: warning: variable ‘numdataflush’ set but not used [-Wunused-but-set-variable]
On Sat, Aug 24, 2013 at 11:38 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On Thu, 2013-07-11 at 23:42 +0900, Sawada Masahiko wrote: >> please find the attached patch. > > Please fix these compiler warnings: > > xlog.c:3117:2: warning: implicit declaration of function ‘SyncRepWaitForLSN’ [-Wimplicit-function-declaration] > syncrep.c:414:6: warning: variable ‘numdataflush’ set but not used [-Wunused-but-set-variable] Thank you for your information! We are improving the patch for Commit Fest 2 now. We will fix above compiler warnings as soon as possible and submit the patch -- Regards, ------- Sawada Masahiko
We are improving the patch for Commit Fest 2 now.
We will fix above compiler warnings as soon as possible and submit the patch
Attached *synchronous_transfer_v5.patch* implements review comments from commit fest-1 and reduces the performance overhead of synchronous_transfer.
*synchronous_transfer_documentation_v1.patch* adds fail back safe standby mechanism in the PostgreSQL documentation.
Sawada-san worked very hard to get this thing done, most of the work is done by him. I really appreciate his efforts :)
**Brief description of suggestions from commit fest -1 are as follows:
1) Fail back-safe standby is not an appropriate name [done - changed it to synchronous_transfer].
2) Remove extra set of postgresql.conf parameters [ done - Now there is only one additional postgresql.conf parameter *synchronous_transfer* which controls the synchronous nature of WAL transfer].
3) Performance overhead measurement [ done- with fast transaction workloads and large loads index builds ].
4) Put the SyncRepWaitForLSN inside XLogFlush [ Not the correct way - as SyncRepWaitForLSN will go in critical section and any error ( it will do network I/O and may also sleep) inside critical section leads to server PANIC and restart ].
5) Old master's WAL ahead of new master's WAL - [ we overcome with this by deleting all WAL files of old master details can be found here : https://wiki.postgresql.org/wiki/Synchronous_Transfer]
**Changes to postgresql.conf to configure fail back safe standby:
1) Synchronous fail-back safe standby
synchronous_standby_names = <server name>
synchronous_transfer = all
2) Asynchronous fail-back safe standby
synchronous_standby_names = <server name>
synchronous_transfer = data_flush
3) Pure synchronous standby
synchronous_standby_names = <server name>
synchronous_transfer = commit
4) Pure asynchronous standby
synchronous_transfer = commit
**Restriction:
If multiple standby servers connect to the master, then the standby with synchronous replication becomes failback safe standby.
for example: if there are 2 standby servers which connects to master server (one is SYNC, another one is ASYNC) and synchronous_transfer is set 'all'.
Then SYNC standby becomes failback safe standby and master server will wait only for SYNC standby server.
**Performance overhead of synchronous_transfer patch:
Tests are performed with pgbench benchmark with following configuration options:
Transaction type: TPC-B (sort of)
Scaling factor: 300
Query mode: simple
Number of clients: 150
Number of threads: 1
Duration: 1800 s
Real time scenarios mostly based on fast transaction workloads for which synchronous_transfer have negligible overhead.
** 1. Test for fast transaction workloads [measured w.r.t default replication in PostgreSQL, pgbench benchmark - TPS value]:
a. On an average performance overhead caused by synchronous standby: 0.0102 %.
b. On an average performance overhead caused by synchronous failback safe standby: 0.2943 %.
c. On an average performance overhead caused by Asynchronous standby: 0.04321 %.
d. On an average performance overhead caused by asynchronous failback safe standby: 0.5141 %
**2. Test for large loads and index builds [measured w.r.t default replication in PostgreSQL, pgbench benchmark (-i option) - time in seconds]:
a. On an average performance overhead caused by synchronous standby: 3.51 %.
b. On an average performance overhead caused by synchronous failback safe standby: 14.88%.
c. On an average performance overhead caused by Asynchronous standby: 0.4887%.
d. On an average performance overhead caused by asynchronous failback safe standby: 10.19%
**TO-DO:
More discussion is needed regarding usefulness/need and priority on following. any feedback is appreciated:
1. Support for multiple fail back safe standbys.
2. Current design of patch will wait forever for the failback safe standby like Streaming replication.
3. Support for cascaded failback standby
---
Attachment
On 9/12/13 3:00 AM, Samrat Revagade wrote: > > We are improving the patch for Commit Fest 2 now. > We will fix above compiler warnings as soon as possible and submit > the patch > > > Attached *synchronous_transfer_v5.patch* implements review comments from > commit fest-1 and reduces the performance overhead of synchronous_transfer. There is still this compiler warning: syncrep.c: In function ‘SyncRepReleaseWaiters’: syncrep.c:421:6: warning: variable ‘numdataflush’ set but not used [-Wunused-but-set-variable]
On Fri, Sep 13, 2013 at 1:11 AM, Peter Eisentraut <peter_e@gmx.net> wrote: > On 9/12/13 3:00 AM, Samrat Revagade wrote: >> >> We are improving the patch for Commit Fest 2 now. >> We will fix above compiler warnings as soon as possible and submit >> the patch >> >> >> Attached *synchronous_transfer_v5.patch* implements review comments from >> commit fest-1 and reduces the performance overhead of synchronous_transfer. > > There is still this compiler warning: > > syncrep.c: In function ‘SyncRepReleaseWaiters’: > syncrep.c:421:6: warning: variable ‘numdataflush’ set but not used > [-Wunused-but-set-variable] > Sorry I forgot fix it. I have attached the patch which I modified. Regards, ------- Sawada Masahiko
Attachment
Sorry I forgot fix it.> syncrep.c: In function ‘SyncRepReleaseWaiters’:
> syncrep.c:421:6: warning: variable ‘numdataflush’ set but not used
> [-Wunused-but-set-variable]
>
I have attached the patch which I modified.
Attachment
On Tue, Sep 17, 2013 at 3:45 PM, Samrat Revagade <revagade.samrat@gmail.com> wrote: > >> > syncrep.c: In function ‘SyncRepReleaseWaiters’: >> > syncrep.c:421:6: warning: variable ‘numdataflush’ set but not used >> > [-Wunused-but-set-variable] >> > >> >> Sorry I forgot fix it. >> >> I have attached the patch which I modified. >> > > Attached patch combines documentation patch and source-code patch. I set up synchronous replication with synchronous_transfer = all, and then I ran pgbench -i and executed CHECKPOINT in the master. After that, when I executed CHECKPOINT in the standby, it got stuck infinitely. I guess this was cased by synchronous_transfer feature. How does synchronous_transfer work with cascade replication? If it's set to all in the "sender-side" standby, it can resolve the data page inconsistency between two standbys? Regards, -- Fujii Masao
On Tue, Sep 17, 2013 at 9:52 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Tue, Sep 17, 2013 at 3:45 PM, Samrat Revagade > <revagade.samrat@gmail.com> wrote: >> >>> > syncrep.c: In function ‘SyncRepReleaseWaiters’: >>> > syncrep.c:421:6: warning: variable ‘numdataflush’ set but not used >>> > [-Wunused-but-set-variable] >>> > >>> >>> Sorry I forgot fix it. >>> >>> I have attached the patch which I modified. >>> >> >> Attached patch combines documentation patch and source-code patch. > > I set up synchronous replication with synchronous_transfer = all, and then I ran > pgbench -i and executed CHECKPOINT in the master. After that, when I executed > CHECKPOINT in the standby, it got stuck infinitely. I guess this was cased by > synchronous_transfer feature. Did you set synchronous_standby_names in the standby server? If so, the master server waits for the standby server which is set to synchronous_standby_names. Please let me know detail of this case. > > How does synchronous_transfer work with cascade replication? If it's set to all > in the "sender-side" standby, it can resolve the data page inconsistency between > two standbys? > Currently patch supports the case which two servers are set up SYNC replication. IWO, failback safe standby is the same as SYNC replication standby. User can set synchronous_transfer in only master side. Regards, ------- Sawada Masahiko
On Wed, Sep 18, 2013 at 10:35 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > On Tue, Sep 17, 2013 at 9:52 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> I set up synchronous replication with synchronous_transfer = all, and then I ran >> pgbench -i and executed CHECKPOINT in the master. After that, when I executed >> CHECKPOINT in the standby, it got stuck infinitely. I guess this was cased by >> synchronous_transfer feature. > > Did you set synchronous_standby_names in the standby server? Yes. > If so, the master server waits for the standby server which is set to > synchronous_standby_names. > Please let me know detail of this case. Both master and standby have the same postgresql.conf settings as follows: max_wal_senders = 4 wal_level = hot_standby wal_keep_segments = 32 synchronous_standby_names = '*' synchronous_transfer= all >> How does synchronous_transfer work with cascade replication? If it's set to all >> in the "sender-side" standby, it can resolve the data page inconsistency between >> two standbys? >> > > Currently patch supports the case which two servers are set up SYNC replication. > IWO, failback safe standby is the same as SYNC replication standby. > User can set synchronous_transfer in only master side. So, it's very strange that CHECKPOINT on the standby gets stuck infinitely. Regards, -- Fujii Masao
On Wed, Sep 18, 2013 at 11:45 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, Sep 18, 2013 at 10:35 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >> On Tue, Sep 17, 2013 at 9:52 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> I set up synchronous replication with synchronous_transfer = all, and then I ran >>> pgbench -i and executed CHECKPOINT in the master. After that, when I executed >>> CHECKPOINT in the standby, it got stuck infinitely. I guess this was cased by >>> synchronous_transfer feature. >> >> Did you set synchronous_standby_names in the standby server? > > Yes. > >> If so, the master server waits for the standby server which is set to >> synchronous_standby_names. >> Please let me know detail of this case. > > Both master and standby have the same postgresql.conf settings as follows: > > max_wal_senders = 4 > wal_level = hot_standby > wal_keep_segments = 32 > synchronous_standby_names = '*' > synchronous_transfer = all > >>> How does synchronous_transfer work with cascade replication? If it's set to all >>> in the "sender-side" standby, it can resolve the data page inconsistency between >>> two standbys? >>> >> >> Currently patch supports the case which two servers are set up SYNC replication. >> IWO, failback safe standby is the same as SYNC replication standby. >> User can set synchronous_transfer in only master side. > > So, it's very strange that CHECKPOINT on the standby gets stuck infinitely. > yes I think so. I was not considering that user set synchronous_standby_names in the standby server. it will ocurr I will fix it considering this case. Regards, ------- Sawada Masahiko
On Wed, Sep 18, 2013 at 1:05 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > On Wed, Sep 18, 2013 at 11:45 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Wed, Sep 18, 2013 at 10:35 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >>> On Tue, Sep 17, 2013 at 9:52 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>>> I set up synchronous replication with synchronous_transfer = all, and then I ran >>>> pgbench -i and executed CHECKPOINT in the master. After that, when I executed >>>> CHECKPOINT in the standby, it got stuck infinitely. I guess this was cased by >>>> synchronous_transfer feature. >>> >>> Did you set synchronous_standby_names in the standby server? >> >> Yes. >> >>> If so, the master server waits for the standby server which is set to >>> synchronous_standby_names. >>> Please let me know detail of this case. >> >> Both master and standby have the same postgresql.conf settings as follows: >> >> max_wal_senders = 4 >> wal_level = hot_standby >> wal_keep_segments = 32 >> synchronous_standby_names = '*' >> synchronous_transfer = all >> >>>> How does synchronous_transfer work with cascade replication? If it's set to all >>>> in the "sender-side" standby, it can resolve the data page inconsistency between >>>> two standbys? >>>> >>> >>> Currently patch supports the case which two servers are set up SYNC replication. >>> IWO, failback safe standby is the same as SYNC replication standby. >>> User can set synchronous_transfer in only master side. >> >> So, it's very strange that CHECKPOINT on the standby gets stuck infinitely. >> Sorry I sent mail by mistake. yes I think so. It waits for corresponding WAL replicated. Behaviour of synchronous_transfer is similar to synchronous_standby_names and synchronous replication little. That is, if those parameter is set but the standby server doesn't connect to the master server, the master server waits for corresponding WAL replicated to standby server infinitely. I was not considering that user set synchronous_standby_names in the standby server. I will fix it considering this case. Regards, ------- Sawada Masahiko
On Wed, Sep 18, 2013 at 11:45 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, Sep 18, 2013 at 10:35 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >> On Tue, Sep 17, 2013 at 9:52 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> I set up synchronous replication with synchronous_transfer = all, and then I ran >>> pgbench -i and executed CHECKPOINT in the master. After that, when I executed >>> CHECKPOINT in the standby, it got stuck infinitely. I guess this was cased by >>> synchronous_transfer feature. >> >> Did you set synchronous_standby_names in the standby server? > > Yes. > >> If so, the master server waits for the standby server which is set to >> synchronous_standby_names. >> Please let me know detail of this case. > > Both master and standby have the same postgresql.conf settings as follows: > > max_wal_senders = 4 > wal_level = hot_standby > wal_keep_segments = 32 > synchronous_standby_names = '*' > synchronous_transfer = all > >>> How does synchronous_transfer work with cascade replication? If it's set to all >>> in the "sender-side" standby, it can resolve the data page inconsistency between >>> two standbys? >>> >> >> Currently patch supports the case which two servers are set up SYNC replication. >> IWO, failback safe standby is the same as SYNC replication standby. >> User can set synchronous_transfer in only master side. > > So, it's very strange that CHECKPOINT on the standby gets stuck infinitely. > I attached the patch which I have modified. I have modified that if both synchronous replication and synchronous transfer are requested, but the server still in recovery(i.g., the server is in standby mode), the server doesn't wait for corresponding WAL replicated. Specifically, I added condition RecoveryInProgress(). If both functions(synchronous replication and transfer) are set and user sets up synchronous replication between two servers, user can executes CHECKPOINT on standby side. It will not wait for corresponding WAL replicated. But, If both parameter are set and user doesn't set up synchronous replication(i.g., the master server works alone), the master server waits infinitely when user executes CHECKPOINT. This behaviour is similar to synchronous replication. Regards, ------- Sawada Masahiko
Attachment
On Thu, Sep 19, 2013 at 11:48 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > I attached the patch which I have modified. Thanks for updating the patch! Here are the review comments: I got the compiler warning: syncrep.c:112: warning: unused variable 'i' How does synchronous_transfer work with synchronous_commit? + * accept all the likely variants of "off". This comment should be removed because synchronous_transfer doesn't accept the value "off". + {"commit", SYNCHRONOUS_TRANSFER_COMMIT, true}, ISTM the third value "true" should be "false". + {"0", SYNCHRONOUS_TRANSFER_COMMIT, true}, Why is this needed? + elog(WARNING, "XLogSend sendTimeLineValidUpto(%X/%X) <= sentPtr(%X/%X) AND sendTImeLine", + (uint32) (sendTimeLineValidUpto >> 32), (uint32) sendTimeLineValidUpto, + (uint32) (sentPtr >> 32), (uint32) sentPtr); Why is this needed? +#define SYNC_REP_WAIT_FLUSH 1 +#define SYNC_REP_WAIT_DATA_FLUSH 2 Why do we need to separate the wait-queue for wait-data-flush from that for wait-flush? ISTM that wait-data-flush also can wait for the replication on the wait-queue for wait-flush, and which would simplify the patch. Regards, -- Fujii Masao
On Thu, Sep 19, 2013 at 12:25 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Thu, Sep 19, 2013 at 11:48 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >> I attached the patch which I have modified. > > Thanks for updating the patch! > > Here are the review comments: > Thank you for reviewing! > I got the compiler warning: > > syncrep.c:112: warning: unused variable 'i' > > How does synchronous_transfer work with synchronous_commit? The currently patch synchronous_transfer doesn't work when synchronous_commit is set 'off' or 'local'. if user changes synchronous_commit value on transaction, checkpointer process can't see it. Due to that, even if synchronous_commit is changed to 'off' from 'on', synchronous_transfer doesn't work. I'm planning to modify the patch so that synchronous_transfer is not affected by synchronous_commit. > > + * accept all the likely variants of "off". > > This comment should be removed because synchronous_transfer > doesn't accept the value "off". > > + {"commit", SYNCHRONOUS_TRANSFER_COMMIT, true}, > > ISTM the third value "true" should be "false". > > + {"0", SYNCHRONOUS_TRANSFER_COMMIT, true}, > > Why is this needed? > > + elog(WARNING, "XLogSend sendTimeLineValidUpto(%X/%X) <= > sentPtr(%X/%X) AND sendTImeLine", > + (uint32) (sendTimeLineValidUpto >> 32), (uint32) > sendTimeLineValidUpto, > + (uint32) (sentPtr >> 32), (uint32) sentPtr); > > Why is this needed? > They are unnecessary. I had forgot to remove unnecessary codes. > +#define SYNC_REP_WAIT_FLUSH 1 > +#define SYNC_REP_WAIT_DATA_FLUSH 2 > > Why do we need to separate the wait-queue for wait-data-flush > from that for wait-flush? ISTM that wait-data-flush also can > wait for the replication on the wait-queue for wait-flush, and > which would simplify the patch. > Yes, it seems not necessary to add queue newly. I will delete SYNC_REP_WAIT_DATA_FLUSH and related that. Regards, ------- Sawada Masahiko
On Thu, Sep 19, 2013 at 7:07 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > On Thu, Sep 19, 2013 at 12:25 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Thu, Sep 19, 2013 at 11:48 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >>> I attached the patch which I have modified. >> >> Thanks for updating the patch! >> >> Here are the review comments: >> > > Thank you for reviewing! > >> I got the compiler warning: >> >> syncrep.c:112: warning: unused variable 'i' >> >> How does synchronous_transfer work with synchronous_commit? > > The currently patch synchronous_transfer doesn't work when > synchronous_commit is set 'off' or 'local'. > if user changes synchronous_commit value on transaction, checkpointer > process can't see it. > Due to that, even if synchronous_commit is changed to 'off' from 'on', > synchronous_transfer doesn't work. > I'm planning to modify the patch so that synchronous_transfer is not > affected by synchronous_commit. Hmm... when synchronous_transfer is set to data_flush, IMO the intuitive behaviors are (1) synchronous_commit = on A data flush should wait for the corresponding WAL to be flushed in the standby (2) synchronous_commit = remote_write A data flush should wait for the corresponding WAL to be written to OS in the standby. (3) synchronous_commit = local (4) synchronous_commit = off A data flush should wait for the corresponding WAL to be written locally in the master. Regards, -- Fujii Masao
On Thu, Sep 19, 2013 at 7:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Thu, Sep 19, 2013 at 7:07 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >> On Thu, Sep 19, 2013 at 12:25 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> On Thu, Sep 19, 2013 at 11:48 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >>>> I attached the patch which I have modified. >>> >>> Thanks for updating the patch! >>> >>> Here are the review comments: >>> >> >> Thank you for reviewing! >> >>> I got the compiler warning: >>> >>> syncrep.c:112: warning: unused variable 'i' >>> >>> How does synchronous_transfer work with synchronous_commit? >> >> The currently patch synchronous_transfer doesn't work when >> synchronous_commit is set 'off' or 'local'. >> if user changes synchronous_commit value on transaction, checkpointer >> process can't see it. >> Due to that, even if synchronous_commit is changed to 'off' from 'on', >> synchronous_transfer doesn't work. >> I'm planning to modify the patch so that synchronous_transfer is not >> affected by synchronous_commit. > > Hmm... when synchronous_transfer is set to data_flush, > IMO the intuitive behaviors are > > (1) synchronous_commit = on > A data flush should wait for the corresponding WAL to be > flushed in the standby > > (2) synchronous_commit = remote_write > A data flush should wait for the corresponding WAL to be > written to OS in the standby. > > (3) synchronous_commit = local > (4) synchronous_commit = off > A data flush should wait for the corresponding WAL to be > written locally in the master. > It is good idea. So synchronous_commit value need to be visible from other process. To share synchronous_commit with other process, I will try to put synchronous_commit value into shared buffer. Is there already the guc parameter which is shared with other process? I tried to find such parameter, but there was not it. Regards, ------- Sawada Masahiko
>Attached patch combines documentation patch and source-code patch.
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1749,6 +1749,50 @@ include 'filename'
</listitem>
</varlistentry>
+ <varlistentry id="guc-synchronous-transfer" xreflabel="synchronous_transfer">
+ <term><varname>synchronous_transfer</varname> (<type>enum</type>)</term>
+ <indexterm>
+ <primary><varname>synchronous_transfer</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ This parameter controls the synchronous nature of WAL transfer and
+ maintains file system level consistency between master server and
+ standby server. It specifies whether master server will wait for file
+ system level change (for example : modifying data page) before
+ the corresponding WAL records are replicated to the standby server.
+ </para>
+ <para>
+ Valid values are <literal>commit</>, <literal>data_flush</> and
+ <literal>all</>. The default value is <literal>commit</>, meaning
+ that master will only wait for transaction commits, this is equivalent
+ to turning off <literal>synchronous_transfer</> parameter and standby
+ server will behave as a <quote>synchronous standby </> in
+ Streaming Replication. For value <literal>data_flush</>, master will
+ wait only for data page modifications but not for transaction
+ commits, hence the standby server will act as <quote>asynchronous
+ failback safe standby</>. For value <literal> all</>, master will wait
+ for data page modifications as well as for transaction commits and
+ resultant standby server will act as <quote>synchronous failback safe
+ standby</>.The wait is on background activities and hence will not create performance overhead.
+ To configure synchronous failback safe standby
+ <xref linkend="guc-synchronous-standby-names"> should be set.
+ </para>
+ </listitem>
+ </varlistentry>
@@ -2258,14 +2302,25 @@ include 'filename'</indexterm>
<listitem>
<para>
- Specifies a comma-separated list of standby names that can support
- <firstterm>synchronous replication</>, as described in
- <xref linkend="synchronous-replication">.
- At any one time there will be at most one active synchronous standby;
- transactions waiting for commit will be allowed to proceed after
- this standby server confirms receipt of their data.
- The synchronous standby will be the first standby named in this list
- that is both currently connected and streaming data in real-time
+ Specifies a comma-separated list of standby names. If this parameter
+ is set then standby will behave as synchronous standby in replication,
+ as described in <xref linkend="synchronous-replication"> or synchronous
+ failback safe standby, as described in <xref linkend="failback-safe">.
+ At any time there will be at most one active standby; when standby is
+ synchronous standby in replication, transactions waiting for commit
+ will be allowed to proceed after this standby server confirms receipt
+ of their data. But when standby is synchronous failback safe standby
+ data page modifications as well as transaction commits will be allowed
+ to proceed only after this standby server confirms receipt of their data.
+ If this parameter is set to empty value and
+ <xref linkend="guc-synchronous-transfer"> is set to <literal>data_flush</>
+ then standby is called as asynchronous failback safe standby and only
+ data page modifications will wait before corresponding WAL record is
+ replicated to standby.
+ </para>
+ <para>
+ Synchronous standby in replication will be the first standby named in
+ this list that is both currently connected and streaming data in real-time
(as shown by a state of <literal>streaming</literal> in the
<link linkend="monitoring-stats-views-table">
<literal>pg_stat_replication</></link> view).
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
+
+ <sect2 id="failback-safe">
+ <title>Setting up failback safe standby</title>
+
+ <indexterm zone="high-availability">
+ <primary>Setting up failback safe standby</primary>
+ </indexterm>
+
+ <para>
+ PostgreSQL streaming replication offers durability, but if the master crashes and
+a particular WAL record is unable to reach to standby server, then that
+WAL record is present on master server but not on standby server.
+In such a case master is ahead of standby server in term of WAL records and data in database.
+This leads to file-system level inconsistency between master and standby server.
+For example a heap page update on the master might not have been reflected on standby when master crashes.
+ </para>
+
+ <para>
+Due to this inconsistency, fresh backup of new master onto new standby is needed to re-prepare HA cluster.
+Taking fresh backup can be a very time consuming process when database is of large size. In such a case, disaster recovery
+can take very long time, if streaming replication is used to setup the high availability cluster.
+ </para>
+
+ <para>
+If HA cluster is configured with failback safe standby then this fresh back up can be avoided.
+The <xref linkend="guc-synchronous-transfer"> parameter has control over all WAL transfers and
+will not make any file system level change until master gets a confirmation from standby server.
+This avoids the need of a fresh backup by maintaining consistency.
+ </para>
+
+ <sect3 id="Failback-safe-config">
+ <title>Basic Configuration</title>
+ <para>
+ Failback safe standby can be asynchronous or synchronous in nature.
+ This will depend upon whether master will wait for transaction commit
+ or not. By default failback safe mechanism is turned off.
+ </para>
+
+ <para>
+ The first step to configure HA with failback safe standby is to setup
+ streaming replication. Configuring synchronous failback safe standby
+ requires setting up <xref linkend="guc-synchronous-transfer"> to
+ <literal>all</> and <xref linkend="guc-synchronous-standby-names">
+ must be set to a non-empty value. This configuration will cause each
+ commit and data page modification to wait for confirmation that standby
+ has written corresponding WAL record to durable storage. Configuring
+ asynchronous failback safe standby requires only setting up
+ <xref linkend="guc-synchronous-transfer"> to <literal> data_flush</>.
+ This configuration will cause only data page modifications to wait
+ for confirmation that standby has written corresponding WAL record
+ to durable storage.
+ </para>
+
+ </sect3>
+ </sect2>
</sect1>
<sect1 id="warm-standby-failover">
</para>
<para>
- So, switching from primary to standby server can be fast but requires
- some time to re-prepare the failover cluster. Regular switching from
- primary to standby is useful, since it allows regular downtime on
- each system for maintenance. This also serves as a test of the
- failover mechanism to ensure that it will really work when you need it.
- Written administration procedures are advised.
+ At the time of failover there is a possibility of file-system level
+ inconsistency between the old primary and the old standby server and hence
+ a fresh backup from new master onto old master is needed for configuring
+ the old primary server as a new standby server. Without taking fresh
+ backup even if the new standby starts, streaming replication does not
+ start successfully. The activity of taking backup can be fast for smaller
+ databases but for a large database this activity requires more time to re-prepare the
+ failover cluster in streaming replication configuration of HA cluster.
+ This could break the service level agreement for crash
+ recovery. The need of fresh backup and problem of long
+ recovery time can be solved by using if HA cluster is configured with
+ failback safe standby see <xref linkend="failback-safe">.
+ Failback safe standby allows synchronous WAL transfer at required places
+ while maintaining the file-system level consistency between master and standby
+ server, without having backup to be taken on the old master.
+ </para>
+
+ <para>
+ Regular switching from primary to standby is useful, since it allows
+ regular downtime on each system for maintenance. This also serves as
+ a test of the failover mechanism to ensure that it will really work
+ when you need it. Written administration procedures are advised.
</para>
<para>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 2af1738..da3820f 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
</para>
</listitem>
+
+ <listitem>
+ <para>
+ Set <xref linkend="guc-synchronous-transfer"> to commit; there is no
+ need to guard against database inconsistency between master and standby during failover.
+ </para>
>Attached patch combines documentation patch and source-code patch.I have had a stab at reviewing the documentation. Have a look.
Attachment
On Fri, Sep 20, 2013 at 10:33 PM, Samrat Revagade <revagade.samrat@gmail.com> wrote: > > > > On Fri, Sep 20, 2013 at 3:40 PM, Sameer Thakur <samthakur74@gmail.com> > wrote: >> >> >> >> >>>> >>> >>> >Attached patch combines documentation patch and source-code patch. >> >> >> I have had a stab at reviewing the documentation. Have a look. >> > > Thanks. > Attached patch implements suggestions in documentation. > But comments from Fujii-san still needs to be implemented . > We will implement them soon. > I have attached the patch which modify based on Fujii-san suggested. If synchronous_transfer is set 'data_flush', behaviour of synchronous_transfer with synchronous_commit is (1) synchronous_commit = on A data flush should wait for the corresponding WAL to be flushed in the standby (2) synchronous_commit = remote_write A data flush should wait for the corresponding WAL to be written to OS in the standby. (3) synchronous_commit = local (4) synchronous_commit = off A data flush should wait for the corresponding WAL to be written locally in the master. Even if user changes synchronous_commit value in transaction, other process (e.g. checkpointer process) can't confirm it. Currently patch, each processes uses locally synchronous_commit. Regards, ------- Sawada Masahiko
Attachment
Hmm... when synchronous_transfer is set to data_flush,
IMO the intuitive behaviors are
(1) synchronous_commit = on
A data flush should wait for the corresponding WAL to be
flushed in the standby
(2) synchronous_commit = remote_write
A data flush should wait for the corresponding WAL to be
written to OS in the standby.
(3) synchronous_commit = local
(4) synchronous_commit = off
A data flush should wait for the corresponding WAL to be
written locally in the master.
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
On Thu, Sep 26, 2013 at 8:54 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > > On Thu, Sep 19, 2013 at 4:02 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> >> >> >> Hmm... when synchronous_transfer is set to data_flush, >> IMO the intuitive behaviors are >> >> (1) synchronous_commit = on >> A data flush should wait for the corresponding WAL to be >> flushed in the standby >> >> (2) synchronous_commit = remote_write >> A data flush should wait for the corresponding WAL to be >> written to OS in the standby. >> >> (3) synchronous_commit = local >> (4) synchronous_commit = off >> A data flush should wait for the corresponding WAL to be >> written locally in the master. > > > I thought synchronous_commit and synchronous_transfer are kind of orthogonal > to each other. synchronous_commit only controls whether and how to wait for > the standby only when a transaction commits. synchronous_transfer OTOH tells > how to interpret the standby listed in synchronous_standbys parameter. If > set to "commit" then they are synchronous standbys (like today). If set to > "data_flush", they are asynchronous failback safe standby and if set to > "all" then they are synchronous failback safe standbys. Well, its confusing > :-( > > So IMHO in the current state of things, the synchronous_transfer GUC can not > be changed at a session/transaction level since all backends, including > background workers must honor the settings to guarantee failback safety. > synchronous_commit still works the same way, but is ignored if > synchronous_transfer is set to "data_flush" because that effectively tells > us that the standbys listed under synchronous_standbys are really *async* > standbys with failback safety. > Thank you for comment. I think it is good simple idea. In your opinion, if synchronous_transfer is set 'all' and synchronous_commit is set 'on', the master wait for data flush eve if user sets synchronous_commit to 'local' or 'off'. For example, when user want to do transaction early, user can't do this. we leave the such situation as constraint? Regards, ------- Sawada Masahiko
Thank you for comment. I think it is good simple idea.
>
In your opinion, if synchronous_transfer is set 'all' and
synchronous_commit is set 'on',
the master wait for data flush eve if user sets synchronous_commit to
'local' or 'off'.
For example, when user want to do transaction early, user can't do this.
we leave the such situation as constraint?
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
On Fri, Sep 27, 2013 at 5:18 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > On Fri, Sep 27, 2013 at 1:28 PM, Sawada Masahiko <sawada.mshk@gmail.com> > wrote: >> >> >> > >> >> Thank you for comment. I think it is good simple idea. >> In your opinion, if synchronous_transfer is set 'all' and >> synchronous_commit is set 'on', >> the master wait for data flush eve if user sets synchronous_commit to >> 'local' or 'off'. >> For example, when user want to do transaction early, user can't do this. >> we leave the such situation as constraint? >> > > No, user can still override the transaction commit point wait. So if > > synchronous_transfer is set to "all": > - If synchronous_commit is ON - wait at all points > - If synchronous_commit is OFF - wait only at buffer flush (and other > related to failback safety) points > > synchronous_transfer is set to "data_flush": > - If synchronous_commit is either ON o OFF - do not wait at commit points, > but wait at all other points > > synchronous_transfer is set to "commit": > - If synchronous_commit is ON - wait at commit point > - If synchronous_commit is OFF - do not wait at any point > Thank you for explain. Understood. if synchronous_transfer is set 'all' and user changes synchronous_commit to 'off'( or 'local') at a transaction, the master server wait at buffer flush, but doesn't wait at commit points. Right? In currently patch, synchronous_transfer works in cooperation with synchronous_commit. But if user changes synchronous_commit at a transaction, they are not in cooperation. So, your idea might be better than currently behaviour of synchronous_transfer. Regards, ------- Sawada Masahiko
On Fri, Sep 27, 2013 at 6:44 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > On Fri, Sep 27, 2013 at 5:18 PM, Pavan Deolasee > <pavan.deolasee@gmail.com> wrote: >> On Fri, Sep 27, 2013 at 1:28 PM, Sawada Masahiko <sawada.mshk@gmail.com> >> wrote: >>> >>> >>> > >>> >>> Thank you for comment. I think it is good simple idea. >>> In your opinion, if synchronous_transfer is set 'all' and >>> synchronous_commit is set 'on', >>> the master wait for data flush eve if user sets synchronous_commit to >>> 'local' or 'off'. >>> For example, when user want to do transaction early, user can't do this. >>> we leave the such situation as constraint? >>> >> >> No, user can still override the transaction commit point wait. So if >> >> synchronous_transfer is set to "all": >> - If synchronous_commit is ON - wait at all points >> - If synchronous_commit is OFF - wait only at buffer flush (and other >> related to failback safety) points >> >> synchronous_transfer is set to "data_flush": >> - If synchronous_commit is either ON o OFF - do not wait at commit points, >> but wait at all other points >> >> synchronous_transfer is set to "commit": >> - If synchronous_commit is ON - wait at commit point >> - If synchronous_commit is OFF - do not wait at any point >> > > Thank you for explain. Understood. > if synchronous_transfer is set 'all' and user changes > synchronous_commit to 'off'( or 'local') at a transaction, > the master server wait at buffer flush, but doesn't wait at commit > points. Right? > > In currently patch, synchronous_transfer works in cooperation with > synchronous_commit. > But if user changes synchronous_commit at a transaction, they are not > in cooperation. > So, your idea might be better than currently behaviour of synchronous_transfer. I attached the v11 patch which have fixed following contents. - synchronous_transfer controls to wait at only data flush level, synchronous_commit controls to wait at commit level. ( Based on Pavan suggestion) - If there are no sync replication standby name, both synchronous_commit and synchronous_transfer don't work. - Fixed that we didn't support failback-safe standby. Previous patch can not support failback-safe standby. Because the patch doesn't wait at FlushBuffer which is called by autovacuum. So, if user want to do transaction early temporarily, user need to change the synchronous_transfer value and reload postgresql.conf. Regards, ------- Sawada Masahiko
Attachment
On Fri, Oct 4, 2013 at 1:46 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > On Fri, Sep 27, 2013 at 6:44 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: >> On Fri, Sep 27, 2013 at 5:18 PM, Pavan Deolasee >> <pavan.deolasee@gmail.com> wrote: >>> On Fri, Sep 27, 2013 at 1:28 PM, Sawada Masahiko <sawada.mshk@gmail.com> >>> wrote: >>>> >>>> >>>> > >>>> >>>> Thank you for comment. I think it is good simple idea. >>>> In your opinion, if synchronous_transfer is set 'all' and >>>> synchronous_commit is set 'on', >>>> the master wait for data flush eve if user sets synchronous_commit to >>>> 'local' or 'off'. >>>> For example, when user want to do transaction early, user can't do this. >>>> we leave the such situation as constraint? >>>> >>> >>> No, user can still override the transaction commit point wait. So if >>> >>> synchronous_transfer is set to "all": >>> - If synchronous_commit is ON - wait at all points >>> - If synchronous_commit is OFF - wait only at buffer flush (and other >>> related to failback safety) points >>> >>> synchronous_transfer is set to "data_flush": >>> - If synchronous_commit is either ON o OFF - do not wait at commit points, >>> but wait at all other points >>> >>> synchronous_transfer is set to "commit": >>> - If synchronous_commit is ON - wait at commit point >>> - If synchronous_commit is OFF - do not wait at any point >>> >> >> Thank you for explain. Understood. >> if synchronous_transfer is set 'all' and user changes >> synchronous_commit to 'off'( or 'local') at a transaction, >> the master server wait at buffer flush, but doesn't wait at commit >> points. Right? >> >> In currently patch, synchronous_transfer works in cooperation with >> synchronous_commit. >> But if user changes synchronous_commit at a transaction, they are not >> in cooperation. >> So, your idea might be better than currently behaviour of synchronous_transfer. > > I attached the v11 patch which have fixed following contents. You added several checks into SyncRepWaitForLSN() so that it can handle both synchronous_transfer=data_flush and =commit. This change made the source code of the function very complicated, I'm afraid. To simplify the source code, what about just adding new wait-for-lsn function for data_flush instead of changing SyncRepWaitForLSN()? Obviously that new function and SyncRepWaitForLSN() have the common part. I think that it should be extracted as separate function. + * Note that if sync transfer is requested, we can't regular maintenance until + * standbys to connect. */ - if (synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH) + if (synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH && !SyncTransRequested()) Per discussion with Pavan, ISTM we don't need to avoid setting synchronous_commit to local even if synchronous_transfer is data_flush. But you did that here. Why? When synchronous_transfer = data_flush, anti-wraparound vacuum can be blocked. Is this safe? +#synchronous_transfer = commit # data page synchronization level + # commit, data_flush or all This comment seems confusing. I think that this parameter specifies when to wait for replication. +typedef enum +{ + SYNCHRONOUS_TRANSFER_COMMIT, /* no wait for flush data page */ + SYNCHRONOUS_TRANSFER_DATA_FLUSH, /* wait for data page flush only + * no wait for WAL */ + SYNCHRONOUS_TRANSFER_ALL /* wait for data page flush and WAL*/ +} SynchronousTransferLevel; These comments also seem confusing. For example, I think that the meaning of SYNCHRONOUS_TRANSFER_COMMIT is something like "wait for replication on transaction commit". @@ -521,6 +531,13 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record) */ XLogFlush(lsn); + /* + * If synchronous transfer is requested, wait for failback safe standby + * to receive WAL up to lsn. + */ + if (SyncTransRequested()) + SyncRepWaitForLSN(lsn, true, true); If smgr_redo() is called only during recovery, SyncRepWaitForLSN() doesn't need to be called here. Regards, -- Fujii Masao
On Fri, Oct 4, 2013 at 4:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > > You added several checks into SyncRepWaitForLSN() so that it can handle both > synchronous_transfer=data_flush and =commit. This change made the source code > of the function very complicated, I'm afraid. To simplify the source code, > what about just adding new wait-for-lsn function for data_flush instead of > changing SyncRepWaitForLSN()? Obviously that new function and > SyncRepWaitForLSN() > have the common part. I think that it should be extracted as separate function. Thank you for reviewing and comment! yes I agree with you. I attached the v12 patch which have modified based on above suggestions. - Added new function SyncRepTransferWaitForLSN() and SyncRepWait() SyncRepTransferWaitForLSN() is called on date page flush. OTOH, SyncRepWatiForLSN() is called on transaction commit. And both functions call the SyncRepWait() after check whether sync commit/transfer is requested. Practically server will waits at SyncRepWait(). > > + * Note that if sync transfer is requested, we can't regular > maintenance until > + * standbys to connect. > */ > - if (synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH) > + if (synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH && > !SyncTransRequested()) > > Per discussion with Pavan, ISTM we don't need to avoid setting > synchronous_commit > to local even if synchronous_transfer is data_flush. But you did that here. Why? I made a mistake. I have removed it. > > When synchronous_transfer = data_flush, anti-wraparound vacuum can be blocked. > Is this safe? > In the new version patch, when synchronous_transfer = data_flush/all AND synchronous_standby_names is set, vacuum is blocked. This behaviour of synchronous_transfer similar to synchronous_commit. Should this allow to do anti-wraparound vacuum even if synchronous_transfer = data_flush/all? If so, it also allow to flush the data page while doing vacuum? > +#synchronous_transfer = commit # data page synchronization level > + # commit, data_flush or all > > This comment seems confusing. I think that this parameter specifies when to > wait for replication. > > +typedef enum > +{ > + SYNCHRONOUS_TRANSFER_COMMIT, /* no wait for flush data page */ > + SYNCHRONOUS_TRANSFER_DATA_FLUSH, /* wait for data page flush only > + * no wait for WAL */ > + SYNCHRONOUS_TRANSFER_ALL /* wait for data page flush and WAL*/ > +} SynchronousTransferLevel; > > These comments also seem confusing. For example, I think that the meaning of > SYNCHRONOUS_TRANSFER_COMMIT is something like "wait for replication on > transaction commit". Those comments are modified in new patch. > > @@ -521,6 +531,13 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record) > */ > XLogFlush(lsn); > > + /* > + * If synchronous transfer is requested, wait for failback safe standby > + * to receive WAL up to lsn. > + */ > + if (SyncTransRequested()) > + SyncRepWaitForLSN(lsn, true, true); > > If smgr_redo() is called only during recovery, SyncRepWaitForLSN() doesn't need > to be called here. Thank you for info. I have removed it at smgr_redo(). Regards, ------- Sawada Masahiko
Attachment
I attached the v12 patch which have modified based on above suggestions.
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
On 2013-10-08 15:07:02 +0530, Pavan Deolasee wrote: > On Tue, Oct 8, 2013 at 2:33 PM, Sawada Masahiko <sawada.mshk@gmail.com>wrote: > > > On Fri, Oct 4, 2013 at 4:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > > > > > I attached the v12 patch which have modified based on above suggestions. > > > > There are still some parts of this design/patch which I am concerned about. > > 1. The design clubs synchronous standby and failback safe standby rather > very tightly. IIRC this is based on the feedback you received early, so my > apologies for raising it again so late. It is my impression that there still are several people having pretty fundamental doubts about this approach in general. From what I remember neither Heikki, Simon, Tom nor me were really convinced about this approach. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
It is my impression that there still are several people having pretty
fundamental doubts about this approach in general. From what I remember
neither Heikki, Simon, Tom nor me were really convinced about this
approach.
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
On 08.10.2013 13:00, Pavan Deolasee wrote: > On Tue, Oct 8, 2013 at 3:16 PM, Andres Freund<andres@2ndquadrant.com>wrote: > >> It is my impression that there still are several people having pretty >> fundamental doubts about this approach in general. From what I remember >> neither Heikki, Simon, Tom nor me were really convinced about this >> approach. >> > IIRC you and Tom were particularly skeptical about the approach. But do you > see a technical flaw or a show stopper with the approach ? Heikki has > written pg_rewind which is really very cool. But it fails to handle the > hint bit updates which are not WAL logged unless of course checksums are > turned on. We can have a GUC controlled option to turn WAL logging on for > hint bit updates and then use pg_rewind for the purpose. But I did not see > any agreement on that either. Performance implication of WAL logging every > hint bit update could be huge. Yeah, I definitely think we should work on the pg_rewind approach instead of this patch. It's a lot more flexible. The performance hit of WAL-logging hint bit updates is the price you have to pay, but a lot of people were OK with that to get page checksum, so I think a lot of people would be OK with it for this purpose too. As long as it's optional, of course. And anyone using page checksums are already paying that price. - Heikki
On Tue, Oct 8, 2013 at 6:37 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > > On Tue, Oct 8, 2013 at 2:33 PM, Sawada Masahiko <sawada.mshk@gmail.com> > wrote: >> >> On Fri, Oct 4, 2013 at 4:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> > >> I attached the v12 patch which have modified based on above suggestions. > > > There are still some parts of this design/patch which I am concerned about. > > 1. The design clubs synchronous standby and failback safe standby rather > very tightly. IIRC this is based on the feedback you received early, so my > apologies for raising it again so late. > a. GUC synchrnous_standby_names is used to name synchronous as well as > failback safe standbys. I don't know if that will confuse users. With currently the patch, user can specify the failback safe standby and sync replication standby at same server. synchronous_standby_names So I was thinking that it will not confuse users. > b. synchronous_commit's value will also control whether a sync/async > failback safe standby wait for remote write or flush. Is that reasonable ? > Or should there be a different way to configure the failback safe standby's > WAL safety ? synchronous_commit's values can not control whether sync sync/async failback safe standby wait level. On data page flush, failback safe standby waits for only flush. Should we also allow to wait for remote write? > > 2. With the current design/implementation, user can't configure a > synchronous and an async failback safe standby at the same time. I think we > discussed this earlier and there was an agreement on the limitation. Just > wanted to get that confirmed again. > yes, user can't configure sync standby and async failback safe standby at the same time. The currently patch supports following cases. - sync standby and make same as failback safe standby - async standby and make same as failback safe standby > 3. SyncRepReleaseWaiters() does not know whether its waking up backends > waiting for sync rep or failback safe rep. Is that ok ? For example, I found > that the elog() message announcing next takeover emitted by the function may > look bad. Since changing synchronous_transfer requires server restart, we > can teach SyncRepReleaseWaiters() to look at that parameter to figure out > whether the standby is sync and/or failback safe standby. I agree with you. Are you saying about following comment? if (announce_next_takeover) { announce_next_takeover = false; ereport(LOG, (errmsg("standby \"%s\" is now the synchronous standby with priority %u", application_name, MyWalSnd->sync_standby_priority))); } > > 4. The documentation still need more work to clearly explain the use case. Understood. we will more work to clearly explain the use case. > > 5. Have we done any sort of stress testing of the patch ? If there is a bug, > the data corruption at the master can go unnoticed. So IMHO we need many > crash recovery tests to ensure that the patch is functionally correct. > I have done several testing of the patch. And I have confirmed that data page is not flushed to disk when the master server has not been receive the reply from the standby server. I used pg_filedump. To ensure that the patch is functionally correct, what test should we do? Regards, ------- Sawada Masahiko
On Tue, Oct 8, 2013 at 6:46 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-10-08 15:07:02 +0530, Pavan Deolasee wrote: >> On Tue, Oct 8, 2013 at 2:33 PM, Sawada Masahiko <sawada.mshk@gmail.com>wrote: >> >> > On Fri, Oct 4, 2013 at 4:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> > > >> > I attached the v12 patch which have modified based on above suggestions. >> > >> >> There are still some parts of this design/patch which I am concerned about. >> >> 1. The design clubs synchronous standby and failback safe standby rather >> very tightly. IIRC this is based on the feedback you received early, so my >> apologies for raising it again so late. > > It is my impression that there still are several people having pretty > fundamental doubts about this approach in general. From what I remember > neither Heikki, Simon, Tom nor me were really convinced about this > approach. > Thank you for comment. We are thinking that this approach can solve the real problem. Actually we have confirm the effect of this approach. The master server flushes data page to disk after the master server received reply from the standby server. If you have concern or doubt in technical side, Could you tell me it? Regards, ------- Sawada Masahiko
On 2013-10-08 15:07:02 +0530, Pavan Deolasee wrote:It is my impression that there still are several people having pretty
> On Tue, Oct 8, 2013 at 2:33 PM, Sawada Masahiko <sawada.mshk@gmail.com>wrote:
>
> > On Fri, Oct 4, 2013 at 4:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> > >
> > I attached the v12 patch which have modified based on above suggestions.
> >
>
> There are still some parts of this design/patch which I am concerned about.
>
> 1. The design clubs synchronous standby and failback safe standby rather
> very tightly. IIRC this is based on the feedback you received early, so my
> apologies for raising it again so late.
fundamental doubts about this approach in general. From what I remember
neither Heikki, Simon, Tom nor me were really convinced about this
approach.
* Tom Lane*
# additional complexity to the code it will cause performance overhead - On an average it causes 0.5 - 1% performance overhead for fast transaction workload, as the wait is mostly on backend process. The latest re-factored code, looks less complex.
# Use of rsync with checksum - but many pages on the two servers may differ in their binary values because of hint bits
*Heikki :*
# Use pg_rewind to do the same:
It has well known problem of hint bit updates.
If we use this we need enable checksums or explicitly WAL log hint bits which leads to performance overhead
*Amit Kapila*
# How to take care of extra WAL on old master during recovery.?
we can solve this by deleting all WAL file when old master before it starts as new standby.
*Simon Riggs*
# Renaming patch - done
# remove extra set of parameters - done
# performance drop - On an average it causes 0.5 - 1% performance overhead for fast transaction workload, as the wait is mostly on backend process.
# The way of configuring standby - with synchronous_transfer parameter we can configure 4 types of standby servers depending on the need.
*Fujii Masao*
# how patch interacts with cascaded standby - patch works same as synchronous replication
# CHECKPOINT in the standby, it got stuck infinitely. - fixed this
# Complicated conditions in SyncRepWaitForLSN() – code has been refactored in v11
# Improve source code comments - done
*Pavan Deolasee*
# Interaction of synchronous_commit with synchronous_transfer - Now synchronous_commit only controls whether and how
to wait for the standby only when a transaction commits. synchronous_transfer OTOH tells how to interpret the standby listed in
synchronous_standbys parameter.
# Further Improvements in the documentation - we will do that
# More stress testing - we will do that
Any inputs on stress testing would help.
On Wed, Oct 9, 2013 at 4:54 AM, Samrat Revagade <revagade.samrat@gmail.com> wrote: > On Tue, Oct 8, 2013 at 3:16 PM, Andres Freund <andres@2ndquadrant.com> > wrote: >> On 2013-10-08 15:07:02 +0530, Pavan Deolasee wrote: >> > On Tue, Oct 8, 2013 at 2:33 PM, Sawada Masahiko >> > <sawada.mshk@gmail.com>wrote: >> > >> > > On Fri, Oct 4, 2013 at 4:32 PM, Fujii Masao <masao.fujii@gmail.com> >> > > wrote: >> > > > >> > > I attached the v12 patch which have modified based on above >> > > suggestions. >> > > >> > >> > There are still some parts of this design/patch which I am concerned >> > about. >> > >> > 1. The design clubs synchronous standby and failback safe standby rather >> > very tightly. IIRC this is based on the feedback you received early, so >> > my >> > apologies for raising it again so late. >> >> It is my impression that there still are several people having pretty >> fundamental doubts about this approach in general. From what I remember >> neither Heikki, Simon, Tom nor me were really convinced about this >> approach. > > > > Listing down all objections and their solutions: > > Major Objection on the proposal: > * Tom Lane* > # additional complexity to the code it will cause performance overhead - On > an average it causes 0.5 - 1% performance overhead for fast transaction > workload, as the wait is mostly on backend process. The latest re-factored > code, looks less complex. > # Use of rsync with checksum - but many pages on the two servers may differ > in their binary values because of hint bits > > *Heikki :* > # Use pg_rewind to do the same: > It has well known problem of hint bit updates. > If we use this we need enable checksums or explicitly WAL log hint bits > which leads to performance overhead > > *Amit Kapila* > # How to take care of extra WAL on old master during recovery.? > we can solve this by deleting all WAL file when old master before it starts > as new standby. > > *Simon Riggs* > # Renaming patch - done > # remove extra set of parameters - done > # performance drop - On an average it causes 0.5 - 1% performance overhead > for fast transaction workload, as the wait is mostly on backend process. > # The way of configuring standby - with synchronous_transfer parameter we > can configure 4 types of standby servers depending on the need. > > *Fujii Masao* > # how patch interacts with cascaded standby - patch works same as > synchronous replication > # CHECKPOINT in the standby, it got stuck infinitely. - fixed this > # Complicated conditions in SyncRepWaitForLSN() – code has been refactored > in v11 > # Improve source code comments - done > > *Pavan Deolasee* > # Interaction of synchronous_commit with synchronous_transfer - Now > synchronous_commit only controls whether and how > > to wait for the standby only when a transaction commits. > synchronous_transfer OTOH tells how to interpret the standby listed in > synchronous_standbys parameter. > # Further Improvements in the documentation - we will do that > # More stress testing - we will do that > > Any inputs on stress testing would help. The point is that when there are at least four senior community members expressing serious objections to a concept, three of whom are committes, we shouldn't be considering committing it until at least some of those people have withdrawn your objections. Nearly all patch submitters are in favor of their own patches; that does not entitle them to have those patches, committed even if there is a committer who agrees with them. There needs to be a real consensus on the path forward. If that policy ever changes, I have my own list of things that are on the cutting-room floor that I'll be happy to resurrect. Personally, I don't have a strong opinion on this patch because I have not followed it closely enough. But if Tom, Heikki, Simon, and Andres are all unconvinced that this is a good direction, then put me down for a -1 vote as well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Yeah, I definitely think we should work on the pg_rewind approach instead of this patch. It's a lot more flexible. The performance hit of WAL-logging hint bit updates is the price you have to pay, but a lot of people were OK with that to get page checksum, so I think a lot of people would be OK with it for this purpose too. As long as it's optional, of course. And anyone using page checksums are already paying that price.
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
On Thu, Oct 10, 2013 at 1:41 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > > > Not that I can find any flaw in the OP's patch, but given the major > objections and my own nervousness about documenting this new "failback safe" > standby mode, I am also inclining to improve pg_rewind or whatever it takes > to get it working. Clearly at first we need to have an optional mechanism to > WAL log hint bit updates. There seems to be two ways to do that: > > a. Add a new GUC which can turned on/off and requires server restart to take > effect > b. Add another option for wal_level setting. > > (b) looks better, but I am not sure if we want to support this new level > with and without hot standby. If latter, we will need multiple new levels to > differentiate all those cases. I am OK with supporting it only with hot > standby which is probably what most people do with streaming replication > anyway. > > The other issue is to how to optimally WAL log hint bit updates: > > a. Should we have separate WAL records just for the purpose or should we > piggyback them on heap update/delete/prune etc WAL records ? Of course, > there will be occasions when a simple SELECT also updates hint bits, so most > likely we will need a separate WAL record anyhow. > b. Does it make sense to try to all hint bits in a page if we are WAL > logging it anyways ? I think we have discussed this idea even before just to > minimize the number of writes a heap page receives when hint bits of > different tuples are set at different times, each update triggering a fresh > write. I don't remember whats the consensus for that, but it might be > worthwhile to reconsider that option if we are WAL logging the hint bit > updates. I agree with you. If writing FPW is not large performance degradation, it is just idea that we can use to write FPW in same timing as checksum enabled. i.g., if we support new wal_level, the system writes FPW when a simple SELECT updates hint bits. but checksum function is disabled. Thought? > > We will definitely need some amount of performance benchmarks even if this > is optional. But are there other things to worry about ? Any strong > objections to this idea or any other stow stopper for pg_rewind itself ? > -- Regards, ------- Sawada Masahiko
I agree with you.
If writing FPW is not large performance degradation, it is just idea
that we can use to write FPW in same timing as checksum enabled.
i.g., if we support new wal_level, the system writes FPW when a simple
SELECT updates hint bits. but checksum function is disabled.
Thought?
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
On 24.10.2013 13:02, Pavan Deolasee wrote: > Another difference AFAICS is that checksum feature needs the block to be > backed up only after the first time a hint bit is updated after checkpoint. > But for something like pg_rewind to work, we will need to WAL log every > hint bit update on a page. So we would want to keep it as short as possible. To fix that, pg_rewind could always start the rewinding process from the last checkpoint before the point that the histories diverge, instead of the exact point of divergence. That would make the rewinding more expensive as it needs to read through a lot more WAL, but I think it would still be OK. - Heikki
To fix that, pg_rewind could always start the rewinding process from the last checkpoint before the point that the histories diverge, instead of the exact point of divergence..
That would make the rewinding more expensive as it needs to read through a lot more WAL, but I think it would still be OK.
. Or would the recovery logic apply first WAL without looking at the page lsn ? (Sorry, may be I should read the code instead of asking you)
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
On 24.10.2013 14:15, Pavan Deolasee wrote: > On Thu, Oct 24, 2013 at 4:22 PM, Heikki Linnakangas<hlinnakangas@vmware.com >> wrote: > >> To fix that, pg_rewind could always start the rewinding process from the >> last checkpoint before the point that the histories diverge, instead of the >> exact point of divergence. > > Is that something required even if someone plans to use pg_rewind for a > cluster with checksums enabled ? I mean since only first update after > checkpoint is WAL logged, pg_rewind will break if another update happens > after standby forks. Yes. It's broken as it is, even when checksums are enabled - good catch. I'll go change it to read all the WAL in the target starting from the last checkpoint before the point of divergence. > Or would the recovery logic apply first WAL without > looking at the page lsn ? (Sorry, may be I should read the code instead of > asking you) WAL recovery does apply all the full-page images without looking at the page LSN, but that doesn't help in this case. pg_rewind copies over the blocks from the source server (= promoted standby) that were changed in the target server (= old master), after the standby's history diverged from it. In other words, it reverts the blocks that were changed in the old master, by copying them over from the promoted standby. After that, WAL recovery is performed, using the WAL from the promoted standby, to apply all the changes from the promoted standby that were not present in the old master. But it never replays any WAL from the old master. It reads it through, to construct the list of blocks that were modified, but it doesn't apply them. > If we do what you are suggesting, it seems like a single line patch to me. > In XLogSaveBufferForHint(), we probably need to look at this additional GUC > to decide whether or not to backup the block. Yeah, it's trivial to add such a guc. Will just have to figure out what we want the user interface to be like; should it be a separate guc, or somehow cram it into wal_level? - Heikki
Will just have to figure out what we want the user interface to be like; should it be a separate guc, or somehow cram it into wal_level?
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
Pavan Deolasee escribió: > Yeah, I had brought up similar idea up thread. Right now wal_level is > nicely ordered. But with this additional logic, I am not sure if we would > need multiple new levels and also break that ordering (I don't know if its > important). For example, one may want to set up streaming replication > with/without this feature or hot standby with/without the feature. I don't > have a good idea about how to capture them in wal_level. May be something > like: minimal, archive, archive_with_this_new_feature, hot_standby and > hot_standby_with_this_new_feature. That's confusing. A separate GUC sounds better. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 10/24/2013 04:15 AM, Pavan Deolasee wrote: > If we do what you are suggesting, it seems like a single line patch to me. > In XLogSaveBufferForHint(), we probably need to look at this additional GUC > to decide whether or not to backup the block. Wait, what? Why are we having an additional GUC? I'm opposed to the idea of having a GUC to enable failback. When would anyone using replication ever want to disable that? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 24.10.2013 20:39, Josh Berkus wrote: > On 10/24/2013 04:15 AM, Pavan Deolasee wrote: >> If we do what you are suggesting, it seems like a single line patch to me. >> In XLogSaveBufferForHint(), we probably need to look at this additional GUC >> to decide whether or not to backup the block. > > Wait, what? Why are we having an additional GUC? > > I'm opposed to the idea of having a GUC to enable failback. When would > anyone using replication ever want to disable that? For example, if you're not replicating for high availability purposes, but to keep a reporting standby up-to-date. - Heikki
On 10/24/2013 11:12 AM, Heikki Linnakangas wrote: > On 24.10.2013 20:39, Josh Berkus wrote: >> On 10/24/2013 04:15 AM, Pavan Deolasee wrote: >>> If we do what you are suggesting, it seems like a single line patch >>> to me. >>> In XLogSaveBufferForHint(), we probably need to look at this >>> additional GUC >>> to decide whether or not to backup the block. >> >> Wait, what? Why are we having an additional GUC? >> >> I'm opposed to the idea of having a GUC to enable failback. When would >> anyone using replication ever want to disable that? > > For example, if you're not replicating for high availability purposes, > but to keep a reporting standby up-to-date. What kind of overhead are we talking about here? You probably said, but I've had a mail client meltdown and lost a lot of my -hackers emails. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 24.10.2013 23:07, Josh Berkus wrote: > On 10/24/2013 11:12 AM, Heikki Linnakangas wrote: >> On 24.10.2013 20:39, Josh Berkus wrote: >>> On 10/24/2013 04:15 AM, Pavan Deolasee wrote: >>>> If we do what you are suggesting, it seems like a single line patch >>>> to me. >>>> In XLogSaveBufferForHint(), we probably need to look at this >>>> additional GUC >>>> to decide whether or not to backup the block. >>> >>> Wait, what? Why are we having an additional GUC? >>> >>> I'm opposed to the idea of having a GUC to enable failback. When would >>> anyone using replication ever want to disable that? >> >> For example, if you're not replicating for high availability purposes, >> but to keep a reporting standby up-to-date. > > What kind of overhead are we talking about here? You probably said, but > I've had a mail client meltdown and lost a lot of my -hackers emails. One extra WAL record whenever a hint bit is set on a page, for the first time after a checkpoint. In other words, a WAL record needs to be written in the same circumstances as with page checksums, but the WAL records are much smaller as they don't need to contain a full page image, just the block number of the changed block. Or maybe we'll write the full page image after all, like with page checksums, just without calculating the checksums. It might be tricky to skip the full-page image, because then a subsequent change of the page (which isn't just a hint-bit update) needs to somehow know it needs to take a full page image even though a WAL record for it was already written. - Heikki
On 10/24/2013 01:14 PM, Heikki Linnakangas wrote: > One extra WAL record whenever a hint bit is set on a page, for the first > time after a checkpoint. In other words, a WAL record needs to be > written in the same circumstances as with page checksums, but the WAL > records are much smaller as they don't need to contain a full page > image, just the block number of the changed block. > > Or maybe we'll write the full page image after all, like with page > checksums, just without calculating the checksums. It might be tricky to > skip the full-page image, because then a subsequent change of the page > (which isn't just a hint-bit update) needs to somehow know it needs to > take a full page image even though a WAL record for it was already written. I think it would be worth estimating what this actually looks like in terms of log write quantity. My inclication is to say that if it increases log writes less than 10%, we don't need to provide an option to turn it off. The reasons I don't want to provide a disabling GUC are: a) more GUCs b) confusing users c) causing users to disable rewind *until they need it*, at which point it's too late to enable it. So if there's any way we can avoid having a GUC for this, I'm for it. And if we do have a GUC, failback should be enabled by default. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Thu, Oct 24, 2013 at 10:51 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 10/24/2013 01:14 PM, Heikki Linnakangas wrote: >> One extra WAL record whenever a hint bit is set on a page, for the first >> time after a checkpoint. In other words, a WAL record needs to be >> written in the same circumstances as with page checksums, but the WAL >> records are much smaller as they don't need to contain a full page >> image, just the block number of the changed block. >> >> Or maybe we'll write the full page image after all, like with page >> checksums, just without calculating the checksums. It might be tricky to >> skip the full-page image, because then a subsequent change of the page >> (which isn't just a hint-bit update) needs to somehow know it needs to >> take a full page image even though a WAL record for it was already written. > > I think it would be worth estimating what this actually looks like in > terms of log write quantity. My inclication is to say that if it > increases log writes less than 10%, we don't need to provide an option > to turn it off. > > The reasons I don't want to provide a disabling GUC are: > a) more GUCs > b) confusing users > c) causing users to disable rewind *until they need it*, at which point > it's too late to enable it. > > So if there's any way we can avoid having a GUC for this, I'm for it. > And if we do have a GUC, failback should be enabled by default. +1 on the principle. In fact I've been considering suggesting we might want to retire the difference between archive and hot_standby as wal_level, because the difference is usually so small. And the advantage of hot_standby is in almost every case worth it. Even in the archive recovery mode, being able to do pause_at_recovery_target is extremely useful. And as you say in (c) above, many users don't realize that until it's too late. -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
On Fri, Oct 25, 2013 at 5:57 AM, Magnus Hagander <magnus@hagander.net> wrote: > On Thu, Oct 24, 2013 at 10:51 PM, Josh Berkus <josh@agliodbs.com> wrote: >> On 10/24/2013 01:14 PM, Heikki Linnakangas wrote: >> >> I think it would be worth estimating what this actually looks like in >> terms of log write quantity. My inclication is to say that if it >> increases log writes less than 10%, we don't need to provide an option >> to turn it off. >> >> The reasons I don't want to provide a disabling GUC are: >> a) more GUCs >> b) confusing users >> c) causing users to disable rewind *until they need it*, at which point >> it's too late to enable it. >> >> So if there's any way we can avoid having a GUC for this, I'm for it. >> And if we do have a GUC, failback should be enabled by default. > > +1 on the principle. > > In fact I've been considering suggesting we might want to retire the > difference between archive and hot_standby as wal_level, because the > difference is usually so small. And the advantage of hot_standby is in > almost every case worth it. Even in the archive recovery mode, being > able to do pause_at_recovery_target is extremely useful. And as you > say in (c) above, many users don't realize that until it's too late. > +1. Many user would not realize that it is too late If we will provide it as additional GUC. And I agree with writing log including the block number of the changed block. I think that writing log is not lead huge overhead increase. Is those WAL record replicated to the standby server in synchronous ( of course when configuring sync replication)? I am concerned that it lead performance overhead with such as executing SELECT or auto vacuum. especially, when two servers are in far location. Regards, ------- Sawada Masahiko
On Fri, Oct 25, 2013 at 5:57 AM, Magnus Hagander <magnus@hagander.net> wrote: > In fact I've been considering suggesting we might want to retire the > difference between archive and hot_standby as wal_level, because the > difference is usually so small. And the advantage of hot_standby is in > almost every case worth it. Even in the archive recovery mode, being > able to do pause_at_recovery_target is extremely useful. And as you > say in (c) above, many users don't realize that until it's too late. +1 on removing archive from wal_level. Having both archive and hot_standby for wal_level is confusing, and if I recall correctly hot_standby and archive have been kept as possible settings only to protect people from bugs that the newly-introduced hot_standby could introduce due to the few WAL records it adds. But it has been a couple of releases since there have been no such bugs, no? -- Michael
On 2013-10-24 13:51:52 -0700, Josh Berkus wrote: > On 10/24/2013 01:14 PM, Heikki Linnakangas wrote: > > One extra WAL record whenever a hint bit is set on a page, for the first > > time after a checkpoint. In other words, a WAL record needs to be > > written in the same circumstances as with page checksums, but the WAL > > records are much smaller as they don't need to contain a full page > > image, just the block number of the changed block. > > > > Or maybe we'll write the full page image after all, like with page > > checksums, just without calculating the checksums. It might be tricky to > > skip the full-page image, because then a subsequent change of the page > > (which isn't just a hint-bit update) needs to somehow know it needs to > > take a full page image even though a WAL record for it was already written. > > I think it would be worth estimating what this actually looks like in > terms of log write quantity. My inclication is to say that if it > increases log writes less than 10%, we don't need to provide an option > to turn it off. It entirely depends on your workload. If it happens to be something like: INSERT INTO table (lots_of_data); CHECKPOINT; SELECT * FROM TABLE; i.e. there's a checkpoint between loading the data and reading it - not exactly all that uncommon - we'll need to log something for every page. That can be rather noticeable. Especially as I think it will be rather hard to log anything but a real FPI. I really don't think everyone will want this. I am absolutely not against providing an option to log enough information to make pg_rewind work, but I think providing a command to do *safe* *planned* failover will help in many more. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-10-24 22:57:29 +0200, Magnus Hagander wrote: > In fact I've been considering suggesting we might want to retire the > difference between archive and hot_standby as wal_level, because the > difference is usually so small. And the advantage of hot_standby is in > almost every case worth it. Even in the archive recovery mode, being > able to do pause_at_recovery_target is extremely useful. And as you > say in (c) above, many users don't realize that until it's too late. +1. On 2013-10-25 15:16:30 +0900, Michael Paquier wrote: > But it has been a couple of releases since there have been no such > bugs, no? One 'no' too much? Anyway, I think there have been more recent ones, but it's infrequent enough that we can remove the level anyway. FWIW, I've wondered if we shouldn't remove most of the EnableHotStandby checks in xlog.c. There are way too many difference how StartupXLOG behaves depending on HS. E.g. I quite dislike that we do stuff like StartupCLOG at entirely different times during recovery. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Oct 25, 2013 at 8:08 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-10-24 13:51:52 -0700, Josh Berkus wrote: > > It entirely depends on your workload. If it happens to be something > like: > INSERT INTO table (lots_of_data); > CHECKPOINT; > SELECT * FROM TABLE; > > i.e. there's a checkpoint between loading the data and reading it - not > exactly all that uncommon - we'll need to log something for every > page. That can be rather noticeable. Especially as I think it will be > rather hard to log anything but a real FPI. > > I really don't think everyone will want this. I am absolutely not > against providing an option to log enough information to make pg_rewind > work, but I think providing a command to do *safe* *planned* failover > will help in many more. > I think it is better providing as option to log enough information such as new wal_level. If user doesn't realize until it's too late, such information is contained in checkpoint record? For example if checkpoint record contained information of wal_level then we can inform to user using by such information. BTW this information is useful only for pg_rewind? Is there for anything else? (Sorry it might has already been discussed..) Regards, ------- Sawada Masahiko
On Thu, Oct 24, 2013 at 11:14:14PM +0300, Heikki Linnakangas wrote: > On 24.10.2013 23:07, Josh Berkus wrote: > >On 10/24/2013 11:12 AM, Heikki Linnakangas wrote: > >>On 24.10.2013 20:39, Josh Berkus wrote: > >>>On 10/24/2013 04:15 AM, Pavan Deolasee wrote: > >>>>If we do what you are suggesting, it seems like a single line patch > >>>>to me. > >>>>In XLogSaveBufferForHint(), we probably need to look at this > >>>>additional GUC > >>>>to decide whether or not to backup the block. > >>> > >>>Wait, what? Why are we having an additional GUC? > >>> > >>>I'm opposed to the idea of having a GUC to enable failback. When would > >>>anyone using replication ever want to disable that? > >> > >>For example, if you're not replicating for high availability purposes, > >>but to keep a reporting standby up-to-date. > > > >What kind of overhead are we talking about here? You probably said, but > >I've had a mail client meltdown and lost a lot of my -hackers emails. > > One extra WAL record whenever a hint bit is set on a page, for the > first time after a checkpoint. In other words, a WAL record needs to > be written in the same circumstances as with page checksums, but the > WAL records are much smaller as they don't need to contain a full > page image, just the block number of the changed block. > > Or maybe we'll write the full page image after all, like with page > checksums, just without calculating the checksums. It might be > tricky to skip the full-page image, because then a subsequent change > of the page (which isn't just a hint-bit update) needs to somehow > know it needs to take a full page image even though a WAL record for > it was already written. Sorry to be replying late to this, but while I am not worried about the additional WAL volume, does this change require the transaction to now wait for a WAL sync to disk before continuing? I thought that was the down-side to WAL logging hint bits, not the WAL volume itself. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Bruce Momjian escribió: > On Thu, Oct 24, 2013 at 11:14:14PM +0300, Heikki Linnakangas wrote: > > On 24.10.2013 23:07, Josh Berkus wrote: > > >What kind of overhead are we talking about here? > > > > One extra WAL record whenever a hint bit is set on a page, for the > > first time after a checkpoint. In other words, a WAL record needs to > > be written in the same circumstances as with page checksums, but the > > WAL records are much smaller as they don't need to contain a full > > page image, just the block number of the changed block. > > > > Or maybe we'll write the full page image after all, like with page > > checksums, just without calculating the checksums. It might be > > tricky to skip the full-page image, because then a subsequent change > > of the page (which isn't just a hint-bit update) needs to somehow > > know it needs to take a full page image even though a WAL record for > > it was already written. > > Sorry to be replying late to this, but while I am not worried about the > additional WAL volume, does this change require the transaction to now > wait for a WAL sync to disk before continuing? I don't think so. There's extra WAL written, but there's no flush-and-wait until end of transaction (as has always been). > I thought that was the down-side to WAL logging hint bits, not the WAL > volume itself. I don't think this is true either. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Bruce Momjian escribió:> On Thu, Oct 24, 2013 at 11:14:14PM +0300, Heikki Linnakangas wrote:
> > On 24.10.2013 23:07, Josh Berkus wrote:> > >What kind of overhead are we talking about here?
> >> > One extra WAL record whenever a hint bit is set on a page, for theI don't think so. There's extra WAL written, but there's no
> > first time after a checkpoint. In other words, a WAL record needs to
> > be written in the same circumstances as with page checksums, but the
> > WAL records are much smaller as they don't need to contain a full
> > page image, just the block number of the changed block.
> >
> > Or maybe we'll write the full page image after all, like with page
> > checksums, just without calculating the checksums. It might be
> > tricky to skip the full-page image, because then a subsequent change
> > of the page (which isn't just a hint-bit update) needs to somehow
> > know it needs to take a full page image even though a WAL record for
> > it was already written.
>
> Sorry to be replying late to this, but while I am not worried about the
> additional WAL volume, does this change require the transaction to now
> wait for a WAL sync to disk before continuing?
flush-and-wait until end of transaction (as has always been).
On 2013-11-21 14:40:36 -0800, Jeff Janes wrote: > But if the transaction would not have otherwise generated WAL (i.e. a > select that did not have to do any HOT pruning, or an update with zero rows > matching the where condition), doesn't it now have to flush and wait when > it would otherwise not? We short circuit that if there's no xid assigned. Check RecordTransactionCommit(). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Nov 21, 2013 at 11:43:34PM +0100, Andres Freund wrote: > On 2013-11-21 14:40:36 -0800, Jeff Janes wrote: > > But if the transaction would not have otherwise generated WAL (i.e. a > > select that did not have to do any HOT pruning, or an update with zero rows > > matching the where condition), doesn't it now have to flush and wait when > > it would otherwise not? > > We short circuit that if there's no xid assigned. Check > RecordTransactionCommit(). OK, that was my question, now answered. Thanks. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 2013-11-21 14:40:36 -0800, Jeff Janes wrote:We short circuit that if there's no xid assigned. Check
> But if the transaction would not have otherwise generated WAL (i.e. a
> select that did not have to do any HOT pruning, or an update with zero rows
> matching the where condition), doesn't it now have to flush and wait when
> it would otherwise not?
RecordTransactionCommit().
On 2014-01-16 09:25:51 -0800, Jeff Janes wrote: > On Thu, Nov 21, 2013 at 2:43 PM, Andres Freund <andres@2ndquadrant.com>wrote: > > > On 2013-11-21 14:40:36 -0800, Jeff Janes wrote: > > > But if the transaction would not have otherwise generated WAL (i.e. a > > > select that did not have to do any HOT pruning, or an update with zero > > rows > > > matching the where condition), doesn't it now have to flush and wait when > > > it would otherwise not? > > > > We short circuit that if there's no xid assigned. Check > > RecordTransactionCommit(). > > > > It looks like that only short-circuits the flush if both there is no xid > assigned, and !wrote_xlog. (line 1054 of xact.c) Hm. Indeed. Why don't we just always use the async commit behaviour for that? I don't really see any significant dangers from doing so? It's also rather odd to use the sync rep mechanisms in such scenarios... The if() really should test markXidCommitted instead of wrote_xlog. > I do see stalls on fdatasync on flush from select statements which had no > xid, but did generate xlog due to HOT pruning, I don't see why WAL logging > hint bits would be different. Are the stalls at commit or while the select is running? If wal_buffers is filled too fast, which can easily happen if loads of pages are hinted and wal logged, that will happen independently from RecordTransactionCommit(). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-01-16 09:25:51 -0800, Jeff Janes wrote:Hm. Indeed. Why don't we just always use the async commit behaviour for
> On Thu, Nov 21, 2013 at 2:43 PM, Andres Freund <andres@2ndquadrant.com>wrote:
>
> > On 2013-11-21 14:40:36 -0800, Jeff Janes wrote:
> > > But if the transaction would not have otherwise generated WAL (i.e. a
> > > select that did not have to do any HOT pruning, or an update with zero
> > rows
> > > matching the where condition), doesn't it now have to flush and wait when
> > > it would otherwise not?
> >
> > We short circuit that if there's no xid assigned. Check
> > RecordTransactionCommit().
> >
>
> It looks like that only short-circuits the flush if both there is no xid
> assigned, and !wrote_xlog. (line 1054 of xact.c)
that? I don't really see any significant dangers from doing so?
It's also rather odd to use the sync rep mechanisms in such
scenarios... The if() really should test markXidCommitted instead of
wrote_xlog.Are the stalls at commit or while the select is running? If wal_buffers
> I do see stalls on fdatasync on flush from select statements which had no
> xid, but did generate xlog due to HOT pruning, I don't see why WAL logging
> hint bits would be different.
is filled too fast, which can easily happen if loads of pages are hinted
and wal logged, that will happen independently fromRecordTransactionCommit().
On 2014-01-16 11:01:29 -0800, Jeff Janes wrote: > I think the argument is that drawing the next value from a sequence can > generate xlog that needs to be flushed, but doesn't assign an xid. Then that should assign an xid. Which would yield correct behaviour with async commit where it's currently *not* causing a WAL flush at all unless a page boundary is crossed. I've tried arguing that way before... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Jeff Janes <jeff.janes@gmail.com> writes: > I think the argument is that drawing the next value from a sequence can > generate xlog that needs to be flushed, but doesn't assign an xid. > I would think the sequence should flush that record before it hands out the > value, not before the commit, but... IIRC the argument was that we'd flush WAL before any use of the value could make it to disk. Which is true if you're just inserting it into a table; perhaps less so if the client is doing something external to the database with it. (But it'd be reasonable to say that clients who want a guaranteed-good serial for such purposes should have to commit the transaction that created the value.) regards, tom lane