Thread: Sync Rep v19
Latest version of Sync Rep, which includes substantial internal changes and simplifications from previous version. (25-30 changes). Includes all outstanding technical comments, typos and docs. I will continue to work on self review and test myself, though actively encourage others to test and report issues. Interesting changes * docs updated * names listed in synchronous_standby_names are now in priority order * synchronous_standby_names = "*" matches all standby names * pg_stat_replication now shows standby priority - this is an ordinal number so "1" means 1st, "2" means 2nd etc, though 0 means "not a sync standby". The only *currently* outstanding point of discussion is the "when to wait" debate, which we aren't moving quickly towards consensus on at this stage. I see that as a "How should it work?" debate and something we can chew over during Alpha/Beta, not as an immediate blocker to commit. Please comment on the patch and also watch changes to the repo git://github.com/simon2ndQuadrant/postgres.git -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services
Attachment
On Thu, Mar 3, 2011 at 7:53 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Latest version of Sync Rep, which includes substantial internal changes > and simplifications from previous version. (25-30 changes). > > Includes all outstanding technical comments, typos and docs. I will > continue to work on self review and test myself, though actively > encourage others to test and report issues. Thanks for the patch! > * synchronous_standby_names = "*" matches all standby names Using '*' as the default seems to lead the performance degradation by being connected from unexpected synchronous standby. > * pg_stat_replication now shows standby priority - this is an ordinal > number so "1" means 1st, "2" means 2nd etc, though 0 means "not a sync > standby". monitoring.sgml should be updated. Though I've not read whole of the patch yet, here is the current comment: Using MyProc->lwWaiting and lwWaitLink for backends to wait for replication looks fragile. Since they are used also by lwlock, the value of them can be changed unexpectedly. Instead, how about defining dedicated variables for replication? + else if (WaitingForSyncRep) + { + /* + * This must NOT be a FATAL message. We want the state of the + * transaction being aborted to be indeterminate to ensure that + * the transaction completion guarantee is never broken. + */ The backend can reach this code path after returning the commit to the client. Instead, how about doing this in EndCommand, to close the connection before returning the commit? + LWLockAcquire(SyncRepLock, LW_EXCLUSIVE); + sync_priority = walsnd->sync_standby_priority; + LWLockRelease(SyncRepLock); LW_SHARE can be used here, instead. + /* + * Wait no longer if we have already reached our LSN + */ + if (XLByteLE(XactCommitLSN, queue->lsn)) + { + /* No need to wait */ + LWLockRelease(SyncRepLock); + return; + } It might take long to acquire SyncRepLock, so how about comparing our LSN with WalSnd->flush before here? replication_timeout_client depends on GetCurrentTransactionStopTimestamp(). In COMMIT case, it's OK. But In PREPARE TRANSACTION, COMMIT PREPARED and ROLLBACK PREPARED cases, it seems problematic because they don't call SetCurrentTransactionStopTimestamp(). In SyncRepWaitOnQueue, the backend can theoretically call WaitLatch() again after the wake-up from the latch. In this case, the "timeout" should be calculated again. Otherwise, it would take unexpectedly very long to cause the timeout. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, 2011-03-04 at 00:02 +0900, Fujii Masao wrote: > > * synchronous_standby_names = "*" matches all standby names > > Using '*' as the default seems to lead the performance degradation by > being connected from unexpected synchronous standby. You can configure it however you wish. It seemed better to have an out of the box setting that was useful. > > * pg_stat_replication now shows standby priority - this is an ordinal > > number so "1" means 1st, "2" means 2nd etc, though 0 means "not a sync > > standby". > > monitoring.sgml should be updated. Didn't think it needed to be, but I've added a few lines to explain. > Though I've not read whole of the patch yet, here is the current comment: > > Using MyProc->lwWaiting and lwWaitLink for backends to wait for replication > looks fragile. Since they are used also by lwlock, the value of them can be > changed unexpectedly. Instead, how about defining dedicated variables for > replication? Yes, I think the queue stuff needs a rewrite now. > + else if (WaitingForSyncRep) > + { > + /* > + * This must NOT be a FATAL message. We want the state of the > + * transaction being aborted to be indeterminate to ensure that > + * the transaction completion guarantee is never broken. > + */ > > The backend can reach this code path after returning the commit to the client. > Instead, how about doing this in EndCommand, to close the connection before > returning the commit? OK, will look. > + LWLockAcquire(SyncRepLock, LW_EXCLUSIVE); > + sync_priority = walsnd->sync_standby_priority; > + LWLockRelease(SyncRepLock); > > LW_SHARE can be used here, instead. Seemed easier to keep it simple and have all lockers use LW_EXCLUSIVE. But I've changed it for you. > + /* > + * Wait no longer if we have already reached our LSN > + */ > + if (XLByteLE(XactCommitLSN, queue->lsn)) > + { > + /* No need to wait */ > + LWLockRelease(SyncRepLock); > + return; > + } > > It might take long to acquire SyncRepLock, so how about comparing > our LSN with WalSnd->flush before here? If we're not the sync standby and we need to takeover the role of sync standby we may need to issue a wakeup even though our standby reached that LSN some time before. So we need to check each time. > replication_timeout_client depends on GetCurrentTransactionStopTimestamp(). > In COMMIT case, it's OK. But In PREPARE TRANSACTION, COMMIT PREPARED > and ROLLBACK PREPARED cases, it seems problematic because they don't call > SetCurrentTransactionStopTimestamp(). Shame on them! Seems reasonable that they should call SetCurrentTransactionStopTimestamp(). I don't want to make a special case there for prepared transactions. > In SyncRepWaitOnQueue, the backend can theoretically call WaitLatch() again > after the wake-up from the latch. In this case, the "timeout" should > be calculated > again. Otherwise, it would take unexpectedly very long to cause the timeout. That was originally modelled on on the way the statement_timeout timer works. If it gets nudged and wakes up too early it puts itself back to sleep to wakeup at the same time again. I've renamed the variables to make that clearer and edited slightly. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
Simon Riggs <simon@2ndQuadrant.com> writes: > On Fri, 2011-03-04 at 00:02 +0900, Fujii Masao wrote: >> > * synchronous_standby_names = "*" matches all standby names >> >> Using '*' as the default seems to lead the performance degradation by >> being connected from unexpected synchronous standby. > > You can configure it however you wish. It seemed better to have an out > of the box setting that was useful. Well the HBA still needs some opening before anyone can claim to be a standby. I guess the default line would be commented out and no standby would be accepted as synchronous by default, assuming this GUC is sighup. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Thu, 2011-03-03 at 18:51 +0100, Dimitri Fontaine wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Fri, 2011-03-04 at 00:02 +0900, Fujii Masao wrote: > >> > * synchronous_standby_names = "*" matches all standby names > >> > >> Using '*' as the default seems to lead the performance degradation by > >> being connected from unexpected synchronous standby. > > > > You can configure it however you wish. It seemed better to have an out > > of the box setting that was useful. > > Well the HBA still needs some opening before anyone can claim to be a > standby. I guess the default line would be commented out and no standby > would be accepted as synchronous by default, assuming this GUC is sighup. The patch sets "*" as the default, so all standbys are synchronous by default. Would you prefer it if it was blank, meaning no standbys are synchronous, by default? -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Thu, Mar 3, 2011 at 1:14 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, 2011-03-03 at 18:51 +0100, Dimitri Fontaine wrote: >> Simon Riggs <simon@2ndQuadrant.com> writes: >> > On Fri, 2011-03-04 at 00:02 +0900, Fujii Masao wrote: >> >> > * synchronous_standby_names = "*" matches all standby names >> >> >> >> Using '*' as the default seems to lead the performance degradation by >> >> being connected from unexpected synchronous standby. >> > >> > You can configure it however you wish. It seemed better to have an out >> > of the box setting that was useful. >> >> Well the HBA still needs some opening before anyone can claim to be a >> standby. I guess the default line would be commented out and no standby >> would be accepted as synchronous by default, assuming this GUC is sighup. > > The patch sets "*" as the default, so all standbys are synchronous by > default. > > Would you prefer it if it was blank, meaning no standbys are > synchronous, by default? I think * is a reasonable default. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, 2011-03-04 at 00:02 +0900, Fujii Masao wrote: > + else if (WaitingForSyncRep) > + { > + /* > + * This must NOT be a FATAL message. We want > the state of the > + * transaction being aborted to be > indeterminate to ensure that > + * the transaction completion guarantee is > never broken. > + */ > > The backend can reach this code path after returning the commit to the > client. > Instead, how about doing this in EndCommand, to close the connection > before > returning the commit? I don't really understand this comment. You can't get there after returning the COMMIT message. Once we have finished waiting we set WaitingForSyncRep = false, before we return to RecordTransactionCommit() and continue from there. Anyway, this is code in the interrupt handler and only gets executed when we receive SIGTERM for a fast shutdown. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 2011-03-03 11:53, Simon Riggs wrote: > Latest version of Sync Rep, which includes substantial internal changes > and simplifications from previous version. (25-30 changes). > > Includes all outstanding technical comments, typos and docs. I will > continue to work on self review and test myself, though actively > encourage others to test and report issues. > > Interesting changes > > * docs updated > > * names listed in synchronous_standby_names are now in priority order > > * synchronous_standby_names = "*" matches all standby names > > * pg_stat_replication now shows standby priority - this is an ordinal > number so "1" means 1st, "2" means 2nd etc, though 0 means "not a sync > standby". Some initial remarks: 1) this works nice: application_name not in synchronous_standby_names -> sync_priority = 0 (OK) change synchronous_standby_names to default *, reload conf -> sync_priority = 1 (OK) message in log file LOG: 00000: standby "walreceiver" is now the synchronous standby with priority 1 2) priorities I have to get used to mapping the integers to synchronous replication meaning. 0 -> asynchronous 1 -> the synchronous standby that is waited for 2 and higher -> potential syncs Could it be hidden from the user? I liked asynchronous / synchronous / potential synchronous then the log message could be LOG: 00000: standby "walreceiver" is now the synchronous standby 3) walreceiver is the default application name - could there be problems when a second standby with that name connects (ofcourse the same question holds for two the same nondefault application_names)? regards Yeb Havinga
On Thu, 2011-03-03 at 22:27 +0100, Yeb Havinga wrote: > On 2011-03-03 11:53, Simon Riggs wrote: > > Latest version of Sync Rep, which includes substantial internal changes > > and simplifications from previous version. (25-30 changes). > > > > Includes all outstanding technical comments, typos and docs. I will > > continue to work on self review and test myself, though actively > > encourage others to test and report issues. > > > > Interesting changes > > > > * docs updated > > > > * names listed in synchronous_standby_names are now in priority order > > > > * synchronous_standby_names = "*" matches all standby names > > > > * pg_stat_replication now shows standby priority - this is an ordinal > > number so "1" means 1st, "2" means 2nd etc, though 0 means "not a sync > > standby". > Some initial remarks: > > 1) this works nice: > application_name not in synchronous_standby_names -> sync_priority = 0 (OK) > change synchronous_standby_names to default *, reload conf -> > sync_priority = 1 (OK) > > message in log file > LOG: 00000: standby "walreceiver" is now the synchronous standby with > priority 1 > > 2) priorities > I have to get used to mapping the integers to synchronous replication > meaning. > 0 -> asynchronous > 1 -> the synchronous standby that is waited for > 2 and higher -> potential syncs > > Could it be hidden from the user? I liked asynchronous / synchronous / > potential synchronous Yes, that sounds good. I will leave it as it is now to gain other comments since this need not delay commit. > then the log message could be > LOG: 00000: standby "walreceiver" is now the synchronous standby The priority is mentioned in the LOG message, so you can understand what happens when multiple standbys connect. e.g. if you have synchronous_standby_names = 'a, b, c' and then the standbys connect in the order b, c, a then you will see log messages LOG: standby "b" is now the synchronous standby with priority 2 LOG: standby "a" is now the synchronous standby with priority 1 It's designed so no matter which order standbys arrive in it is the highest priority standby that makes it to the front in the end. > 3) walreceiver is the default application name - could there be problems > when a second standby with that name connects (ofcourse the same > question holds for two the same nondefault application_names)? That's documented: in that case which standby is sync is indeterminate. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
Simon Riggs <simon@2ndQuadrant.com> writes: > Anyway, this is code in the interrupt handler and only gets executed > when we receive SIGTERM for a fast shutdown. I trust it's not getting *directly* executed from the interrupt handler, at least not without ImmediateInterruptOK. regards, tom lane
On Fri, Mar 4, 2011 at 7:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> Anyway, this is code in the interrupt handler and only gets executed >> when we receive SIGTERM for a fast shutdown. > > I trust it's not getting *directly* executed from the interrupt handler, > at least not without ImmediateInterruptOK. Yes, the backend waits for replication while cancel/die interrupt is being blocked, i.e., InterruptHoldoffCount > 0. So SIGTERM doesn't lead the waiting backend to there directly. The backend reaches there after returning the result. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Mar 4, 2011 at 1:27 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Fri, Mar 4, 2011 at 7:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Simon Riggs <simon@2ndQuadrant.com> writes: >>> Anyway, this is code in the interrupt handler and only gets executed >>> when we receive SIGTERM for a fast shutdown. >> >> I trust it's not getting *directly* executed from the interrupt handler, >> at least not without ImmediateInterruptOK. > > Yes, the backend waits for replication while cancel/die interrupt is > being blocked, i.e., InterruptHoldoffCount > 0. So SIGTERM doesn't > lead the waiting backend to there directly. The backend reaches there > after returning the result. BTW, this is true in COMMIT and PREPARE cases, and false in COMMIT PREPARED and ROLLBACK PREPARED cases. In the latter cases, HOLD_INTERRUPT() is not called before waiting for replication. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, 2011-03-04 at 13:35 +0900, Fujii Masao wrote: > On Fri, Mar 4, 2011 at 1:27 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > > On Fri, Mar 4, 2011 at 7:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Simon Riggs <simon@2ndQuadrant.com> writes: > >>> Anyway, this is code in the interrupt handler and only gets executed > >>> when we receive SIGTERM for a fast shutdown. > >> > >> I trust it's not getting *directly* executed from the interrupt handler, > >> at least not without ImmediateInterruptOK. > > > > Yes, the backend waits for replication while cancel/die interrupt is > > being blocked, i.e., InterruptHoldoffCount > 0. So SIGTERM doesn't > > lead the waiting backend to there directly. The backend reaches there > > after returning the result. > > BTW, this is true in COMMIT and PREPARE cases, CommitTransaction() calls HOLD_INTERRUPT() and then RESUME_INTERRUPTS(), which was reasonable before we started waiting for syncrep. The interrupt does occur *before* we send the message back, but doesn't work effectively at interrupting the wait in the way you would like. If we RESUME_INTERRUPTS() prior to waiting and then HOLD again that would allow all signals not just SIGTERM. We would need to selectively reject everything except SIGTERM messages. Ideas? Alter ProcessInterrupts() to accept an interrupt if ProcDiePending && WaitingForSyncRep and InterruptHoldoffCount > 0. That looks a little scary, but looks like it will work. > and false in > COMMIT PREPARED and ROLLBACK PREPARED cases. In the > latter cases, HOLD_INTERRUPT() is not called before waiting for > replication. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services
Attachment
On Fri, Mar 4, 2011 at 12:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > Though I've not read whole of the patch yet, here is the current comment: Here are another comments: +#replication_timeout_client = 120 # 0 means wait forever Typo: s/replication_timeout_client/sync_replication_timeout + else if (timeout > 0 && + TimestampDifferenceExceeds(GetCurrentTransactionStopTimestamp(), + wait_start, timeout)) If SetCurrentTransactionStopTimestamp() is called before (i.e., COMMIT case), the return value of GetCurrentTransactionStopTimestamp() is the same as "wait_start". So, in this case, the timeout never expires. + strcpy(new_status + len, " waiting for sync rep"); + set_ps_display(new_status, false); How about changing the message to something like "waiting for %X/%X" (%X/%X indicates the LSN which the backend is waiting for)? Please initialize MyProc->procWaitLink to NULL in InitProcess() as well as do MyProc->lwWaitLink. + /* + * We're a potential sync standby. Release waiters if we are the + * highest priority standby. We do this even if the standby is not yet + * caught up, in case this is a restart situation and + * there are backends waiting for us. That allows backends to exit the + * wait state even if new backends cannot yet enter the wait state. + */ I don't think that it's good idea to switch the high priority standby which has not caught up, to the sync one, especially when there is already another sync standby. Because that degrades replication from sync to async for a while, even though there is sync standby which has caught up. + if (walsnd->pid != 0 && + walsnd->sync_standby_priority > 0 && + (priority == 0 || + priority < walsnd->sync_standby_priority)) + { + priority = walsnd->sync_standby_priority; + syncWalSnd = walsnd; + } According to the code, the last named standby has highest priority. But the document says the opposite. ISTM the waiting backends can be sent the wake-up signal by the walsender multiple times since the walsender doesn't remove any entry from the queue. Isn't this unsafe? waste of the cycle? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Mar 4, 2011 at 3:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > CommitTransaction() calls HOLD_INTERRUPT() and then RESUME_INTERRUPTS(), > which was reasonable before we started waiting for syncrep. The > interrupt does occur *before* we send the message back, but doesn't work > effectively at interrupting the wait in the way you would like. > > If we RESUME_INTERRUPTS() prior to waiting and then HOLD again that > would allow all signals not just SIGTERM. We would need to selectively > reject everything except SIGTERM messages. > > Ideas? > > Alter ProcessInterrupts() to accept an interrupt if ProcDiePending && > WaitingForSyncRep and InterruptHoldoffCount > 0. That looks a little > scary, but looks like it will work. If shutdown is requested before WaitingForSyncRep is set to TRUE and after HOLD_INTERRUPT() is called, the waiting backends cannot be interrupted. SIGTERM can be sent by pg_terminate_backend(). So we should check whether shutdown is requested before emitting WARNING and closing the connection. If it's not requested yet, I think that it's safe to return the success indication to the client. I think that it's safer to close the connection and terminate the backend after cleaning all the resources. So, as I suggested before, how about doing that in EndCommand()? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, 2011-03-04 at 17:34 +0900, Fujii Masao wrote: > On Fri, Mar 4, 2011 at 3:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > CommitTransaction() calls HOLD_INTERRUPT() and then RESUME_INTERRUPTS(), > > which was reasonable before we started waiting for syncrep. The > > interrupt does occur *before* we send the message back, but doesn't work > > effectively at interrupting the wait in the way you would like. > > > > If we RESUME_INTERRUPTS() prior to waiting and then HOLD again that > > would allow all signals not just SIGTERM. We would need to selectively > > reject everything except SIGTERM messages. > > > > Ideas? > > > > Alter ProcessInterrupts() to accept an interrupt if ProcDiePending && > > WaitingForSyncRep and InterruptHoldoffCount > 0. That looks a little > > scary, but looks like it will work. > > If shutdown is requested before WaitingForSyncRep is set to TRUE and > after HOLD_INTERRUPT() is called, the waiting backends cannot be > interrupted. > > SIGTERM can be sent by pg_terminate_backend(). So we should check > whether shutdown is requested before emitting WARNING and closing > the connection. If it's not requested yet, I think that it's safe to return the > success indication to the client. I'm not sure if that matters. Nobody apart from the postmaster knows about a shutdown. All the other processes know is that they received SIGTERM, which as you say could have been a specific user action aimed at an individual process. We need a way to end the wait state explicitly, so it seems easier to make SIGTERM the initiating action, no matter how it is received. The alternative is to handle it this way 1) set something in shared memory 2) set latch of all backends 3) have the backends read shared memory and then end the wait Who would do (1) and (2)? Not the backend, its sleeping, not the postmaster its shm, nor a WALSender cos it might not be there. Seems like a lot of effort to avoid SIGTERM. Do we have a good reason why we need that? Might it introduce other issues? > I think that it's safer to close the connection and terminate the backend > after cleaning all the resources. So, as I suggested before, how about > doing that in EndCommand()? Yes, if we don't use SIGTERM then we would use EndCommand() -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Fri, 2011-03-04 at 16:42 +0900, Fujii Masao wrote: > On Fri, Mar 4, 2011 at 12:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > > Though I've not read whole of the patch yet, here is the current comment: > > Here are another comments: > > +#replication_timeout_client = 120 # 0 means wait forever > > Typo: s/replication_timeout_client/sync_replication_timeout Done > + else if (timeout > 0 && > + TimestampDifferenceExceeds(GetCurrentTransactionStopTimestamp(), > + wait_start, timeout)) > > If SetCurrentTransactionStopTimestamp() is called before (i.e., COMMIT case), > the return value of GetCurrentTransactionStopTimestamp() is the same as > "wait_start". So, in this case, the timeout never expires. Don't understand (still) > + strcpy(new_status + len, " waiting for sync rep"); > + set_ps_display(new_status, false); > > How about changing the message to something like "waiting for %X/%X" > (%X/%X indicates the LSN which the backend is waiting for)? Done > Please initialize MyProc->procWaitLink to NULL in InitProcess() as well as > do MyProc->lwWaitLink. I'm rewriting that aspect now. > + /* > + * We're a potential sync standby. Release waiters if we are the > + * highest priority standby. We do this even if the standby is not yet > + * caught up, in case this is a restart situation and > + * there are backends waiting for us. That allows backends to exit the > + * wait state even if new backends cannot yet enter the wait state. > + */ > > I don't think that it's good idea to switch the high priority standby which has > not caught up, to the sync one, especially when there is already another > sync standby. Because that degrades replication from sync to async for > a while, even though there is sync standby which has caught up. OK, that wasn't really my intention. Changed. > + if (walsnd->pid != 0 && > + walsnd->sync_standby_priority > 0 && > + (priority == 0 || > + priority < walsnd->sync_standby_priority)) > + { > + priority = walsnd->sync_standby_priority; > + syncWalSnd = walsnd; > + } > > According to the code, the last named standby has highest priority. But the > document says the opposite. Priority is a difficult word here since "1" is the highest priority. I deliberately avoided using the word "highest" in the code for that reason. The code above finds the lowest non-zero standby, which is correct as documented. > ISTM the waiting backends can be sent the wake-up signal by the > walsender multiple times since the walsender doesn't remove any > entry from the queue. Isn't this unsafe? waste of the cycle? It's ok to set a latch that isn't set. It's unlikely to wake someone twice before they can remove themselves. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Fri, 2011-03-04 at 10:51 +0000, Simon Riggs wrote: > > + else if (timeout > 0 && > > + TimestampDifferenceExceeds(GetCurrentTransactionStopTimestamp(), > > + wait_start, timeout)) > > > > If SetCurrentTransactionStopTimestamp() is called before (i.e., COMMIT case), > > the return value of GetCurrentTransactionStopTimestamp() is the same as > > "wait_start". So, in this case, the timeout never expires. > > Don't understand (still) OK, coffee has seeped into brain now, thanks. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 2011-03-03 11:53, Simon Riggs wrote: > Latest version of Sync Rep, which includes substantial internal changes > and simplifications from previous version. (25-30 changes). Testing more with the post v19 version from github with HEAD commit 009875662e1b47012e1f4b7d30eb9e238d1937f6 Author: Simon Riggs <simon@2ndquadrant.com> Date: Fri Mar 4 06:13:43 2011 +0000 Allow SIGTERM messages in ProcessInterrupts() even when interrupts are held, if WaitingForSyncRep 1) unexpected behaviour - master has synchronous_standby_names = 'standby1,standby2,standby3' - standby with 'standby2' connects first. - LOG: 00000: standby "standby2" is now the synchronous standby with priority 2 I'm still confused by the priority numbers. At first I thought that priority 1 meant: this is the one that is currently waited for. Now I'm not sure if this is the first potential standby that is not used, or that it is actually the one waited for. What I expected was that it would be connected with priority 1. And then if the standby1 connect, it would become the one with prio1 and standby2 with prio2. 2) unexpected behaviour - continued from above - standby with 'asyncone' name connects next - no log message on master I expected a log message along the lines 'standby "asyncone" is now an asynchronous standby' 3) more about log messages - didn't get a log message that the asyncone standby stopped - didn't get a log message that standby1 connected with priority 1 - after stop / start master, again only got a log that standby2 connectied with priority 2 - pg_stat_replication showed both standb1 and standby2 with correct prio# 4) More about the priority stuff. At this point I figured out prio 2 can also be 'the real sync'. Still I'd prefer in pg_stat_replication a boolean that clearly shows 'this is the one', with a source that is intimately connected to the syncrep implemenation, instead of a different implementation of 'if lowest connected priority and > 0, then sync is true. If there are two different implementations, there is room for differences, which doesn't feel right. 5) performance. Seems to have dropped a a few dozen %. With v17 I earlier got ~650 tps and after some more tuning over 900 tps. Now with roughly the same setup I get ~ 550 tps. Both versions on the same hardware, both compiled without debugging, and I used the same postgresql.conf start config. I'm currently thinking about a failure test that would check if a commit has really waited for the standby. What's the worst thing to do to a master server? Ideas are welcome :-) #!/bin/sh psql -c "create a big table with generate_series" echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger regards, Yeb Havinga
On Fri, 2011-03-04 at 12:24 +0100, Yeb Havinga wrote: > On 2011-03-03 11:53, Simon Riggs wrote: > > Latest version of Sync Rep, which includes substantial internal changes > > and simplifications from previous version. (25-30 changes). > Testing more with the post v19 version from github with HEAD Thanks > commit 009875662e1b47012e1f4b7d30eb9e238d1937f6 > Author: Simon Riggs <simon@2ndquadrant.com> > Date: Fri Mar 4 06:13:43 2011 +0000 > > Allow SIGTERM messages in ProcessInterrupts() even when interrupts are > held, if WaitingForSyncRep > > > 1) unexpected behaviour > - master has synchronous_standby_names = 'standby1,standby2,standby3' > - standby with 'standby2' connects first. > - LOG: 00000: standby "standby2" is now the synchronous standby with > priority 2 > > I'm still confused by the priority numbers. At first I thought that > priority 1 meant: this is the one that is currently waited for. Now I'm > not sure if this is the first potential standby that is not used, or > that it is actually the one waited for. > What I expected was that it would be connected with priority 1. And then > if the standby1 connect, it would become the one with prio1 and standby2 > with prio2. The priority refers to the order in which that standby is listed in synchronous_standby_names. That is not dependent upon who is currently connected. It doesn't mean the order in which the currently connected standbys will become the sync standby. So the log message allows you to work out that "standby2" is connected and will operate as sync standby until something mentioned earlier in synchronous_standby_names, in this case standby1, connects. > 2) unexpected behaviour > - continued from above > - standby with 'asyncone' name connects next > - no log message on master > > I expected a log message along the lines 'standby "asyncone" is now an > asynchronous standby' That would introduce messages where there currently aren't any, so I left that out. I'll put it in for clarity. > 3) more about log messages > - didn't get a log message that the asyncone standby stopped OK > - didn't get a log message that standby1 connected with priority 1 Bad > - after stop / start master, again only got a log that standby2 > connectied with priority 2 Bad > - pg_stat_replication showed both standb1 and standby2 with correct prio# Good Please send me log output at DEBUG3 offline. > 4) More about the priority stuff. At this point I figured out prio 2 can > also be 'the real sync'. Still I'd prefer in pg_stat_replication a > boolean that clearly shows 'this is the one', with a source that is > intimately connected to the syncrep implemenation, instead of a > different implementation of 'if lowest connected priority and > 0, then > sync is true. If there are two different implementations, there is room > for differences, which doesn't feel right. OK > 5) performance. > Seems to have dropped a a few dozen %. With v17 I earlier got ~650 tps > and after some more tuning over 900 tps. Now with roughly the same setup > I get ~ 550 tps. Both versions on the same hardware, both compiled > without debugging, and I used the same postgresql.conf start config. Will need to re-look at performance after commit > I'm currently thinking about a failure test that would check if a commit > has really waited for the standby. What's the worst thing to do to a > master server? Ideas are welcome :-) > > #!/bin/sh > psql -c "create a big table with generate_series" > echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger > > regards, > Yeb Havinga > -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 2011-03-04 12:24, Yeb Havinga wrote: > I'm currently thinking about a failure test that would check if a > commit has really waited for the standby. What's the worst thing to do > to a master server? Ideas are welcome :-) > > #!/bin/sh > psql -c "create a big table with generate_series" > echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger Did that with both a sync and async standby server, then promoted both replicas. Both replicas had the complete big table. Maybe the async server was somehow 'saved' by the master waiting for the sync server? Test repeated with only the async one connected. The master then shows this at restart LOG: 00000: record with zero length at 4/B2CD3598 LOG: 00000: redo done at 4/B2CD3558 LOG: 00000: last completed transaction was at log time 2011-03-04 14:43:31.02041+01 The async promoted server LOG: 00000: record with zero length at 4/B2CC9260 LOG: 00000: redo done at 4/B2CC9220 LOG: 00000: last completed transaction was at log time 2011-03-04 14:43:31.018444+01 Even though the async server had the complete relation I created, something was apparently done just before the reboot. Test repeated with only 1 sync standby Then on master at recovery LOG: 00000: record with zero length at 4/D1051C88 LOG: 00000: redo done at 4/D1051C48 LOG: 00000: last completed transaction was at log time 2011-03-04 14:52:11.035188+01 on the sync promoted server LOG: 00000: redo done at 4/D1051C48 LOG: 00000: last completed transaction was at log time 2011-03-04 14:52:11.035188+01 Nice! regards, Yeb Havinga
On Fri, Mar 4, 2011 at 7:51 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> + if (walsnd->pid != 0 && >> + walsnd->sync_standby_priority > 0 && >> + (priority == 0 || >> + priority < walsnd->sync_standby_priority)) >> + { >> + priority = walsnd->sync_standby_priority; >> + syncWalSnd = walsnd; >> + } >> >> According to the code, the last named standby has highest priority. But the >> document says the opposite. > > Priority is a difficult word here since "1" is the highest priority. I > deliberately avoided using the word "highest" in the code for that > reason. > > The code above finds the lowest non-zero standby, which is correct as > documented. Hmm.. that seems to find the highest standby. And, I could confirm that in my box. Please see the following. The priority (= 2) of synchronous standby (its sync_state is SYNC) is higher than that (= 1) of potential one (its sync_state is POTENTIAL). postgres=# SHOW synchronous_standby_names ;synchronous_standby_names ---------------------------one, two (1 row) postgres=# SELECT application_name, state, sync_priority, sync_state FROM pg_stat_replication;application_name | state | sync_priority | sync_state ------------------+-----------+---------------+------------one | STREAMING | 1 | POTENTIALtwo | STREAMING | 2 | SYNC (2 rows) Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, 2011-03-04 at 23:15 +0900, Fujii Masao wrote: > postgres=# SELECT application_name, state, sync_priority, sync_state > FROM pg_stat_replication; > application_name | state | sync_priority | sync_state > ------------------+-----------+---------------+------------ > one | STREAMING | 1 | POTENTIAL > two | STREAMING | 2 | SYNC > (2 rows) Bug! Thanks. Fixed -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Fri, Mar 4, 2011 at 7:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> SIGTERM can be sent by pg_terminate_backend(). So we should check >> whether shutdown is requested before emitting WARNING and closing >> the connection. If it's not requested yet, I think that it's safe to return the >> success indication to the client. > > I'm not sure if that matters. Nobody apart from the postmaster knows > about a shutdown. All the other processes know is that they received > SIGTERM, which as you say could have been a specific user action aimed > at an individual process. > > We need a way to end the wait state explicitly, so it seems easier to > make SIGTERM the initiating action, no matter how it is received. > > The alternative is to handle it this way > 1) set something in shared memory > 2) set latch of all backends > 3) have the backends read shared memory and then end the wait > > Who would do (1) and (2)? Not the backend, its sleeping, not the > postmaster its shm, nor a WALSender cos it might not be there. > > Seems like a lot of effort to avoid SIGTERM. Do we have a good reason > why we need that? Might it introduce other issues? On the second thought... I was totally wrong. Preventing the backend from returning the commit when shutdown is requested doesn't help to avoid the data loss at all. Without shutdown, the following simple scenario can cause data loss. 1. Replication connection is closed because of network outage. 2. Though replication has not been completed, the waiting backend is released since the timeout expires. Then it returnsthe success to the client. 3. The primary crashes, and then the clusterware promotes the standby which doesn't have the latest change on the primaryto new primary. Data lost happens! In the first place, there are two kinds of data loss: (A) Pysical data loss This is the case where we can never retrieve the committed data physically. For example, if the storage of the standalone server gets corrupted, we would lose some data forever. To avoid this type of data loss, we would have to choose the "wait-forever" behavior. But as I said in upthread, we can decrease the risk of this data loss to a certain extent by spending much money on the storage. So, if that cost is less than the cost which we have to pay when down-time happens, we don't need to choose the "wait-forever" option. (B) Logical data loss This is the case where we think wrongly that the committed data has been lost while we can actually retrieve it physically. For example, in the above three-steps scenario, we can read all the committed data from two servers physically even after failover. But since the client attempts to read data only from new primary, some data looks lost to the client. The "wait-forever" behavior can help also to avoid this type of data loss. And, another way is to STONITH the standby before the timeout releases any waiting backend. If so, we can completely prevent the outdated standby from being brought up, and can avoid logical data loss. According to my quick research, in DRBD, the "dopd (DRBD outdate-peer daemon)" plays that role. What I'd like to avoid is (B). Though (A) is more serious problem than (B), we already have some techniques to decrease the risk of (A). But not (B), I think. The "wait-forever" might be a straightforward approach against (B). But this option prevents transactions from running not only when the synchronous standby goes away, but also when the primary is invoked first or when the standby is promoted at failover. Since the availability of the database service decreases very much, I don't want to use that. Keeping transactions waiting in the latter two cases would be required to avoid (A), but not (B). So I think that we can relax the "wait-forever" option so that it allows not-replicated transactions to complete only in those cases. IOW, when we initially start the primary, the backends don't wait at all for new standby to connect. And, while new primary is running alone after failover, the backends don't wait at all, too. Only when replication connection is closed while streaming WAL to sync standby, the backends wait until new sync standby has connected and replication has been completed. Even in this case, if we want to improve the service availability, we have only to make something like dopd to STONITH the outdated standby, and then request the primary to release the waiting backends. So I think that the interface to request that release should be implemented. Fortunately, that partial "wait-forever" behavior has already been implemented in Simon's patch with the client timeout = 0 (disable). If he implements the interface to release the waiting backends, I'm OK with his design about when to release the backends for 9.1 (unless I'm missing something). Thought? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Mar 4, 2011 at 3:04 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > The "wait-forever" might be a straightforward approach against (B). But > this option prevents transactions from running not only when the > synchronous standby goes away, but also when the primary is invoked > first or when the standby is promoted at failover. Since the availability > of the database service decreases very much, I don't want to use that. I continue to think that wait-forever is the most sensible option. If you want all of your data on the disks of two machines before the commit is ack'd, I think you probably want that all the time. The second scenario you mentioned ("when the standby is promoted at failover") is quite easy to handle. If you don't want synchronous replication after a standby promotion, then configure the master to do synchronous replication and the slave not to do synchronous replication. Similarly, if you've got an existing machine that is not doing synchronous replication and you want to start, fire up the standby in asynchronous mode and switch to synchronous replication after it has fully caught up. It seems to me that we're bent on providing a service that does synchronous replication except when it first starts up or when the timeout expires or when the phase of the moon is waxing gibbons, and I don't get the point of that. If I ask for synchronous replication, I want it to be synchronous until I explicitly turn it off. Otherwise, when I fail over, how do I know if I've got all my transactions, or not? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 3, 2011 at 1:23 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> The patch sets "*" as the default, so all standbys are synchronous by >> default. >> >> Would you prefer it if it was blank, meaning no standbys are >> synchronous, by default? > > I think * is a reasonable default. > Actually i would prefer to have standbys asynchronous by default... though is true that there will be no waits until i set synchronous_replication to on... 1) it could be confusing to see a SYNC standby in pg_stat_replication by default when i wanted all of them to be async, 2) also * will give priority 1 to all standbys so it doesn't seem like a very useful out-of-the-box configuration, better to make the dba to write the standby names in the order they want -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte y capacitación de PostgreSQL
On Fri, Mar 4, 2011 at 4:18 PM, Jaime Casanova <jaime@2ndquadrant.com> wrote: > On Thu, Mar 3, 2011 at 1:23 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> >>> The patch sets "*" as the default, so all standbys are synchronous by >>> default. >>> >>> Would you prefer it if it was blank, meaning no standbys are >>> synchronous, by default? >> >> I think * is a reasonable default. >> > > Actually i would prefer to have standbys asynchronous by default... > though is true that there will be no waits until i set > synchronous_replication to on... 1) it could be confusing to see a > SYNC standby in pg_stat_replication by default when i wanted all of > them to be async, 2) also * will give priority 1 to all standbys so it > doesn't seem like a very useful out-of-the-box configuration, better > to make the dba to write the standby names in the order they want Mmm, good points. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2011-03-04 22:18, Jaime Casanova wrote: > On Thu, Mar 3, 2011 at 1:23 PM, Robert Haas<robertmhaas@gmail.com> wrote: >>> The patch sets "*" as the default, so all standbys are synchronous by >>> default. >>> >>> Would you prefer it if it was blank, meaning no standbys are >>> synchronous, by default? >> I think * is a reasonable default. >> > Actually i would prefer to have standbys asynchronous by default... > though is true that there will be no waits until i set > synchronous_replication to on... 1) it could be confusing to see a > SYNC standby in pg_stat_replication by default when i wanted all of > them to be async, I see no problem with * for synchronous_standby names, such that *if* synchronous_replication = on, then all standbys are sync. Also for the beginning experimenter with sync rep: what would you expect after only turning 'synchronous_replication' = on? ISTM better than : you need to change two parameters from their default to get a replica in sync mode. > 2) also * will give priority 1 to all standbys so it > doesn't seem like a very useful out-of-the-box configuration, better > to make the dba to write the standby names in the order they want > As somebody with a usecase for two hardware-wise equal sync replicas for the same master (and a single async replica), the whole ordering of sync standbys is too much feature anyway, since it will cause unneccesary 'which is the sync replica' switching. Besides that, letting all syncs have the same priority sounds like the only thing the server can do, if the dba has not specified it explicitly. I would see it as improvement if order in standby_names doesn't mean priority, and that priority could be specified with another parameter (and default: all sync priorities the same) regards, Yeb Havinga
On Sat, 2011-03-05 at 05:04 +0900, Fujii Masao wrote: > On Fri, Mar 4, 2011 at 7:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > >> SIGTERM can be sent by pg_terminate_backend(). So we should check > >> whether shutdown is requested before emitting WARNING and closing > >> the connection. If it's not requested yet, I think that it's safe to return the > >> success indication to the client. > > > > I'm not sure if that matters. Nobody apart from the postmaster knows > > about a shutdown. All the other processes know is that they received > > SIGTERM, which as you say could have been a specific user action aimed > > at an individual process. > > > > We need a way to end the wait state explicitly, so it seems easier to > > make SIGTERM the initiating action, no matter how it is received. > > > > The alternative is to handle it this way > > 1) set something in shared memory > > 2) set latch of all backends > > 3) have the backends read shared memory and then end the wait > > > > Who would do (1) and (2)? Not the backend, its sleeping, not the > > postmaster its shm, nor a WALSender cos it might not be there. > > > > Seems like a lot of effort to avoid SIGTERM. Do we have a good reason > > why we need that? Might it introduce other issues? > > On the second thought... > > I was totally wrong. Preventing the backend from returning the commit > when shutdown is requested doesn't help to avoid the data loss at all. > Without shutdown, the following simple scenario can cause data loss. > > 1. Replication connection is closed because of network outage. > 2. Though replication has not been completed, the waiting backend is > released since the timeout expires. Then it returns the success to > the client. > 3. The primary crashes, and then the clusterware promotes the standby > which doesn't have the latest change on the primary to new primary. > Data lost happens! Yes, that can happen. As people will no doubt observe, this seems to be an argument for wait-forever. What we actually need is a wait that lasts longer than it takes for us to decide to failover, if the standby is actually up and this is some kind of split brain situation. That way the clients are still waiting when failover occurs. WAL is missing, but since we didn't acknowledge the client we are OK to treat that situation as if it were an abort. > In the first place, there are two kinds of data loss: > > (A) Pysical data loss > This is the case where we can never retrieve the committed data > physically. For example, if the storage of the standalone server gets > corrupted, we would lose some data forever. To avoid this type of > data loss, we would have to choose the "wait-forever" behavior. But > as I said in upthread, we can decrease the risk of this data loss to > a certain extent by spending much money on the storage. So, if that > cost is less than the cost which we have to pay when down-time > happens, we don't need to choose the "wait-forever" option. > > (B) Logical data loss > This is the case where we think wrongly that the committed data > has been lost while we can actually retrieve it physically. For example, > in the above three-steps scenario, we can read all the committed data > from two servers physically even after failover. But since the client > attempts to read data only from new primary, some data looks lost to > the client. The "wait-forever" behavior can help also to avoid this type > of data loss. And, another way is to STONITH the standby before the > timeout releases any waiting backend. If so, we can completely prevent > the outdated standby from being brought up, and can avoid logical data > loss. According to my quick research, in DRBD, the "dopd (DRBD > outdate-peer daemon)" plays that role. > > What I'd like to avoid is (B). Though (A) is more serious problem than (B), > we already have some techniques to decrease the risk of (A). But not > (B), I think. > > The "wait-forever" might be a straightforward approach against (B). But > this option prevents transactions from running not only when the > synchronous standby goes away, but also when the primary is invoked > first or when the standby is promoted at failover. Since the availability > of the database service decreases very much, I don't want to use that. > > Keeping transactions waiting in the latter two cases would be required > to avoid (A), but not (B). So I think that we can relax the "wait-forever" > option so that it allows not-replicated transactions to complete only in > those cases. IOW, when we initially start the primary, the backends > don't wait at all for new standby to connect. And, while new primary is > running alone after failover, the backends don't wait at all, too. Only > when replication connection is closed while streaming WAL to sync > standby, the backends wait until new sync standby has connected and > replication has been completed. Even in this case, if we want to > improve the service availability, we have only to make something like > dopd to STONITH the outdated standby, and then request the primary > to release the waiting backends. So I think that the interface to > request that release should be implemented. > > Fortunately, that partial "wait-forever" behavior has already been > implemented in Simon's patch with the client timeout = 0 (disable). > If he implements the interface to release the waiting backends, > I'm OK with his design about when to release the backends for 9.1 > (unless I'm missing something). Almost-working patch attached for the above feature. Time to stop for the day. Patch against current repo version. Current repo version attached here also (v20), which includes all fixes to all known technical issues, major polishing etc.. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services
Attachment
On Sat, Mar 5, 2011 at 7:28 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Almost-working patch attached for the above feature. Time to stop for > the day. Patch against current repo version. > > Current repo version attached here also (v20), which includes all fixes > to all known technical issues, major polishing etc.. Thanks for the patch. Now the code about the wait list looks very simpler than before! Here are the comments: @@ -1337,6 +1352,31 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS) <snip> + if (walsnd->pid != 0 && walsnd->state == WALSNDSTATE_STREAMING) + { + sync_priority[i] = walsnd->sync_standby_priority; This always reports the priority of walsender in CATCHUP state as 0. I don't think that priority needs to be reported as 0. When new standby which has the same priority as current sync standby connects, that new standby can switch to new sync one even though current one is still running. This happens when the index of WalSnd slot which new standby uses is ahead of that which current one uses. People don't expect such an unexpected switchover, I think. + /* + * Assume the queue is ordered by LSN + */ + if (XLByteLT(walsndctl->lsn, proc->waitLSN)) + return numprocs; The code to ensure the assumption needs to be added. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Sat, 2011-03-05 at 16:13 +0900, Fujii Masao wrote: > On Sat, Mar 5, 2011 at 7:28 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Almost-working patch attached for the above feature. Time to stop for > > the day. Patch against current repo version. > > > > Current repo version attached here also (v20), which includes all fixes > > to all known technical issues, major polishing etc.. > > Thanks for the patch. Now the code about the wait list looks very > simpler than before! Here are the comments: > > @@ -1337,6 +1352,31 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS) > <snip> > + if (walsnd->pid != 0 && walsnd->state == WALSNDSTATE_STREAMING) > + { > + sync_priority[i] = walsnd->sync_standby_priority; > > This always reports the priority of walsender in CATCHUP state as 0. > I don't think that priority needs to be reported as 0. Cosmetic change. We can do this, yes. > When new standby which has the same priority as current sync standby > connects, that new standby can switch to new sync one even though > current one is still running. This happens when the index of WalSnd slot > which new standby uses is ahead of that which current one uses. People > don't expect such an unexpected switchover, I think. It is documented that the selection of standby from a set of similar priorities is indeterminate. Users don't like it, they can change it. > + /* > + * Assume the queue is ordered by LSN > + */ > + if (XLByteLT(walsndctl->lsn, proc->waitLSN)) > + return numprocs; > > The code to ensure the assumption needs to be added. Yes, just need to add the code for traversing list backwards. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sat, Mar 5, 2011 at 7:28 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Yes, that can happen. As people will no doubt observe, this seems to be > an argument for wait-forever. What we actually need is a wait that lasts > longer than it takes for us to decide to failover, if the standby is > actually up and this is some kind of split brain situation. That way the > clients are still waiting when failover occurs. WAL is missing, but > since we didn't acknowledge the client we are OK to treat that situation > as if it were an abort. Oracle Data Guard in the maximum availability mode behaves that way? I'm sure that you are implementing something like the maximum availability mode rather than the maximum protection one. So I'd like to know how the data loss situation I described can be avoided in the maximum availability mode. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Sat, 2011-03-05 at 11:04 +0000, Simon Riggs wrote: > > + /* > > + * Assume the queue is ordered by LSN > > + */ > > + if (XLByteLT(walsndctl->lsn, proc->waitLSN)) > > + return numprocs; > > > > The code to ensure the assumption needs to be added. > > Yes, just need to add the code for traversing list backwards. I've added code to shmqueue.c to allow this. New version pushed. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sat, Mar 5, 2011 at 6:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > It is documented that the selection of standby from a set of similar > priorities is indeterminate. Users don't like it, they can change it. That doesn't seem like a good argument to *change* the synchronous standby once it's already set. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, 2011-03-05 at 07:24 -0500, Robert Haas wrote: > On Sat, Mar 5, 2011 at 6:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > It is documented that the selection of standby from a set of similar > > priorities is indeterminate. Users don't like it, they can change it. > > That doesn't seem like a good argument to *change* the synchronous > standby once it's already set. If the order is arbitrary, why does it matter if it changes? The user has the power to specify a sequence, yet they have not done so. They are told the results are indeterminate, which is accurate. I can add the words "and may change as new standbys connect" if that helps. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sat, Mar 5, 2011 at 7:49 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Sat, 2011-03-05 at 07:24 -0500, Robert Haas wrote: >> On Sat, Mar 5, 2011 at 6:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> > It is documented that the selection of standby from a set of similar >> > priorities is indeterminate. Users don't like it, they can change it. >> >> That doesn't seem like a good argument to *change* the synchronous >> standby once it's already set. > > If the order is arbitrary, why does it matter if it changes? > > The user has the power to specify a sequence, yet they have not done so. > They are told the results are indeterminate, which is accurate. I can > add the words "and may change as new standbys connect" if that helps. I just don't think that's very useful behavior. Suppose I have a master and two standbys. Both are local (or both are remote with equally good connectivity). When one of the standbys goes down, there will be a hiccup (i.e. transactions will block trying to commit) until that guy falls off and the other one takes over. Now, when he comes back up again, I don't want the synchronous standby to change again; that seems like a recipe for another hiccup. I think "who the current synchronous standby is" should act as a tiebreak. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Mar 5, 2011 at 2:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:
+1On Sat, Mar 5, 2011 at 7:49 AM, Simon Riggs <simon@2ndquadrant.com> wrote:I just don't think that's very useful behavior. Suppose I have a
> If the order is arbitrary, why does it matter if it changes?
>
> The user has the power to specify a sequence, yet they have not done so.
> They are told the results are indeterminate, which is accurate. I can
> add the words "and may change as new standbys connect" if that helps.
master and two standbys. Both are local (or both are remote with
equally good connectivity). When one of the standbys goes down, there
will be a hiccup (i.e. transactions will block trying to commit) until
that guy falls off and the other one takes over. Now, when he comes
back up again, I don't want the synchronous standby to change again;
that seems like a recipe for another hiccup. I think "who the current
synchronous standby is" should act as a tiebreak.
TLDR part:
The first one might be noticed by users because it takes tens of seconds before the sync switch. The second hiccup is hardly noticable. However limiting the # switches of sync standby to the absolute minimum is also good if e.g. (if there was a hook for it) cluster middleware is notified of the sync replica change. That might either introduce a race condition or be even completely unreliable if the notify is sent asynchronous, or it might introduce a longer lag if the master waits for confirmation of the sync replica change message. At that point sync replica changes become more expensive than they are currently.
regards,
Yeb Havinga
On Sat, 2011-03-05 at 14:44 +0100, Yeb Havinga wrote: > On Sat, Mar 5, 2011 at 2:05 PM, Robert Haas <robertmhaas@gmail.com> > wrote: > On Sat, Mar 5, 2011 at 7:49 AM, Simon Riggs > <simon@2ndquadrant.com> wrote: > > If the order is arbitrary, why does it matter if it changes? > > > > The user has the power to specify a sequence, yet they have > not done so. > > They are told the results are indeterminate, which is > accurate. I can > > add the words "and may change as new standbys connect" if > that helps. > > > I just don't think that's very useful behavior. Suppose I > have a > master and two standbys. Both are local (or both are remote > with > equally good connectivity). When one of the standbys goes > down, there > will be a hiccup (i.e. transactions will block trying to > commit) until > that guy falls off and the other one takes over. Now, when he > comes > back up again, I don't want the synchronous standby to change > again; > that seems like a recipe for another hiccup. I think "who the > current > synchronous standby is" should act as a tiebreak. > > > +1 > > TLDR part: > > The first one might be noticed by users because it takes tens of > seconds before the sync switch. The second hiccup is hardly noticable. > However limiting the # switches of sync standby to the absolute > minimum is also good if e.g. (if there was a hook for it) cluster > middleware is notified of the sync replica change. That might either > introduce a race condition or be even completely unreliable if the > notify is sent asynchronous, or it might introduce a longer lag if the > master waits for confirmation of the sync replica change message. At > that point sync replica changes become more expensive than they are > currently. I'm not in favour. If the user has a preferred order, they can specify it. If there is no preferred order, how will we maintain that order? What are the rules for maintaining this arbitrary order? No, this is not something we need prior to commit, if ever. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sat, 2011-03-05 at 20:08 +0900, Fujii Masao wrote: > On Sat, Mar 5, 2011 at 7:28 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Yes, that can happen. As people will no doubt observe, this seems to be > > an argument for wait-forever. What we actually need is a wait that lasts > > longer than it takes for us to decide to failover, if the standby is > > actually up and this is some kind of split brain situation. That way the > > clients are still waiting when failover occurs. WAL is missing, but > > since we didn't acknowledge the client we are OK to treat that situation > > as if it were an abort. > > Oracle Data Guard in the maximum availability mode behaves that way? > > I'm sure that you are implementing something like the maximum availability > mode rather than the maximum protection one. So I'd like to know how > the data loss situation I described can be avoided in the maximum availability > mode. This is important, so I am taking time to formulate a full reply. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sat, Mar 5, 2011 at 9:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > I've added code to shmqueue.c to allow this. > > New version pushed. New comments; It looks odd to report the sync_state of walsender in BACKUP state as ASYNC. +SyncRepCleanupAtProcExit(int code, Datum arg) +{ + if (WaitingForSyncRep && !SHMQueueIsDetached(&(MyProc->syncrep_links))) + { + LWLockAcquire(SyncRepLock, LW_EXCLUSIVE); + SHMQueueDelete(&(MyProc->syncrep_links)); + LWLockRelease(SyncRepLock); + } + + if (MyProc != NULL) + DisownLatch(&MyProc->waitLatch); Can MyProc really be NULL here? If yes, "MyProc != NULL" should be checked before seeing MyProc->syncrep_links. Even though postmaster dies, the waiting backend keeps waiting until the timeout expires. Instead, the backends should periodically check whether postmaster is alive, and then they should exit immediately if it's not alive, as well as other process does? If the timeout is disabled, such backends would get stuck infinitely. Though I commented about the issue related to shutdown, that was pointless. So change of ProcessInterrupts is not required unless we find the need again. Sorry for the noise.. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Sun, Mar 6, 2011 at 12:07 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > I'm not in favour. > > If the user has a preferred order, they can specify it. If there is no > preferred order, how will we maintain that order? > > What are the rules for maintaining this arbitrary order? Probably what Robert, Yeb and I think is to leave the current sync standby in sync mode until either its connection is closed or higher priority standby connects. No complicated rule is required. To do that, how about tracking which standby is currently in sync mode? Each walsender checks whether its priority is higher than that of current sync one, and if yes, it takes over. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mar 5, 2011, at 11:17 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Sun, Mar 6, 2011 at 12:07 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> I'm not in favour. >> >> If the user has a preferred order, they can specify it. If there is no >> preferred order, how will we maintain that order? >> >> What are the rules for maintaining this arbitrary order? > > Probably what Robert, Yeb and I think is to leave the current > sync standby in sync mode until either its connection is closed > or higher priority standby connects. No complicated rule is > required. > > To do that, how about tracking which standby is currently in > sync mode? Each walsender checks whether its priority is > higher than that of current sync one, and if yes, it takes over. That is precisely what I would expect to happen, and IMHO quite useful. ...Robert
<p>El 05/03/2011 11:18, "Fujii Masao" <<a href="mailto:masao.fujii@gmail.com">masao.fujii@gmail.com</a>> escribió:<br/> ><br /> > On Sun, Mar 6, 2011 at 12:07 AM, Simon Riggs <<a href="mailto:simon@2ndquadrant.com">simon@2ndquadrant.com</a>>wrote:<br /> > > I'm not in favour.<br /> > ><br/> > > If the user has a preferred order, they can specify it. If there is no<br /> > > preferred order,how will we maintain that order?<br /> > ><br /> > > What are the rules for maintaining this arbitraryorder?<br /> ><br /> > Probably what Robert, Yeb and I think is to leave the current<br /> > sync standbyin sync mode until either its connection is closed<br /> > or higher priority standby connects. No complicatedrule is<br /> > required.<br /> ><p>It's not better to remove the code to manage * in synchronous_standby_names?Once we do that there is no chance of having 2 standbys with the same priority.<p>After all, mostof the times the dba will need to change the * for a real list of names anyway. At least in IMHO<p>--<br /> Jaime Casanova <a href="http://www.2ndQuadrant.com">www.2ndQuadrant.com</a>
On Sun, 2011-03-06 at 01:17 +0900, Fujii Masao wrote: > On Sun, Mar 6, 2011 at 12:07 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > I'm not in favour. > > > > If the user has a preferred order, they can specify it. If there is no > > preferred order, how will we maintain that order? > > > > What are the rules for maintaining this arbitrary order? > > Probably what Robert, Yeb and I think is to leave the current > sync standby in sync mode until either its connection is closed > or higher priority standby connects. No complicated rule is > required. No, it is complex. The code is intentionally stateless, so unless you have a rule you cannot work out which one is sync standby. It is much more important to have robust takeover behaviour. Changing this will require rethinking how that takeover works. And I'm not doing that for something that is documented as "indeterminate". If you care about the sequence then set the supplied parameter, which I have gone to some trouble to provide. > To do that, how about tracking which standby is currently in > sync mode? Each walsender checks whether its priority is > higher than that of current sync one, and if yes, it takes over. > > Regards, > -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sat, 2011-03-05 at 11:42 -0500, Jaime Casanova wrote: > El 05/03/2011 11:18, "Fujii Masao" <masao.fujii@gmail.com> escribió: > > > > On Sun, Mar 6, 2011 at 12:07 AM, Simon Riggs <simon@2ndquadrant.com> > wrote: > > > I'm not in favour. > > > > > > If the user has a preferred order, they can specify it. If there > is no > > > preferred order, how will we maintain that order? > > > > > > What are the rules for maintaining this arbitrary order? > > > > Probably what Robert, Yeb and I think is to leave the current > > sync standby in sync mode until either its connection is closed > > or higher priority standby connects. No complicated rule is > > required. > > > > It's not better to remove the code to manage * in > synchronous_standby_names? Once we do that there is no chance of > having 2 standbys with the same priority. Yes, there is, because we don't do duplicate name checking. I've changed the default so it is no longer "*" by default, to avoid complaints. > After all, most of the times the dba will need to change the * for a > real list of names anyway. At least in IMHO > > -- > Jaime Casanova www.2ndQuadrant.com > -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sat, 2011-03-05 at 20:08 +0900, Fujii Masao wrote: > On Sat, Mar 5, 2011 at 7:28 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Yes, that can happen. As people will no doubt observe, this seems to be > > an argument for wait-forever. What we actually need is a wait that lasts > > longer than it takes for us to decide to failover, if the standby is > > actually up and this is some kind of split brain situation. That way the > > clients are still waiting when failover occurs. WAL is missing, but > > since we didn't acknowledge the client we are OK to treat that situation > > as if it were an abort. > > Oracle Data Guard in the maximum availability mode behaves that way? > > I'm sure that you are implementing something like the maximum availability > mode rather than the maximum protection one. So I'd like to know how > the data loss situation I described can be avoided in the maximum availability > mode. It can't. (Oracle or otherwise...) Once we begin waiting for sync rep, if the transaction or backend ends then other backends will be able to see the changed data. The only way to prevent that is to shutdown the database to ensure that no readers or writers have access to that. Oracle's protection mechanism is to shutdown the primary if there is no sync standby available. Maximum Protection. Any other mode must therefore be less than maximum protection, according to Oracle, and me. "Available" here means one that has not timed out, via parameter. Shutting down the main server is cool, as long as you failover to one of the standbys. If there aren't any standbys, or you don't have a mechanism for switching quickly, you have availability problems. What shutting down the server doesn't do is keep the data safe for transactions that were in their commit-wait phase when the disconnect occurs. That data exists, yet will not have been transferred to the standby. >From now, I also say we should wait forever. It is the safest mode and I want no argument about whether sync rep is safe or not. We can introduce a more relaxed mode later with high availability for the primary. That is possible and in some cases desirable. Now, when we lose last sync standby we have three choices: 1. reconnect the standby, or wait for a potential standby to catchup 2. immediate shutdown of master and failover to one of the standbys 3. reclassify an async standby as a sync standby More than likely we would attempt to do (1) for a while, then do (2) This means that when we startup the primary will freeze for a while until the sync standbys connect. Which is OK, since they try to reconnect every 5 seconds and we don't plan on shutting down the primary much anyway. I've removed the timeout parameter, plus if we begin waiting we wait until released, forever if that's how long it takes. The recommendation to use more than one standby remains. Fast shutdown will wake backends from their latch and there isn't any changed interrupt behaviour any more. synchronous_standby_names = '*' is no longer the default On a positive note this is one less parameter and will improve performance as well. All above changes made. Ready to commit, barring concrete objections to important behaviour. I will do one final check tomorrow evening then commit. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services
Attachment
On Sun, 2011-03-06 at 00:42 +0900, Fujii Masao wrote: > On Sat, Mar 5, 2011 at 9:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > I've added code to shmqueue.c to allow this. > > > > New version pushed. > > New comments; None of the requested changes are in v21, as yet. > It looks odd to report the sync_state of walsender in BACKUP > state as ASYNC. Cool. > +SyncRepCleanupAtProcExit(int code, Datum arg) > +{ > + if (WaitingForSyncRep && !SHMQueueIsDetached(&(MyProc->syncrep_links))) > + { > + LWLockAcquire(SyncRepLock, LW_EXCLUSIVE); > + SHMQueueDelete(&(MyProc->syncrep_links)); > + LWLockRelease(SyncRepLock); > + } > + > + if (MyProc != NULL) > + DisownLatch(&MyProc->waitLatch); > > Can MyProc really be NULL here? If yes, "MyProc != NULL" should be > checked before seeing MyProc->syncrep_links. OK > Even though postmaster dies, the waiting backend keeps waiting until > the timeout expires. Instead, the backends should periodically check > whether postmaster is alive, and then they should exit immediately > if it's not alive, as well as other process does? If the timeout is > disabled, such backends would get stuck infinitely. Will wake them every 60 seconds > Though I commented about the issue related to shutdown, that was > pointless. So change of ProcessInterrupts is not required unless we > find the need again. Sorry for the noise.. Yep, all gone now. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sun, Mar 6, 2011 at 12:42 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > New comments; Another one; + long timeout = SyncRepGetWaitTimeout(); <snip> + else if (timeout > 0 && + TimestampDifferenceExceeds(wait_start, now, timeout)) + { The third argument of TimestampDifferenceExceeds is expressed in milliseconds. But you wrongly give that the microseconds. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Sat, Mar 5, 2011 at 5:53 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
Will retest with new version this evening. Also curious to performance improvement, since v17 seems to be topscorer in that department.
regardd,
Yeb Havinga
On a positive note this is one less parameter and will improve
performance as well.
All above changes made.
Ready to commit, barring concrete objections to important behaviour.
I will do one final check tomorrow evening then commit.
Will retest with new version this evening. Also curious to performance improvement, since v17 seems to be topscorer in that department.
regardd,
Yeb Havinga
On Sat, Mar 5, 2011 at 11:56 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Even though postmaster dies, the waiting backend keeps waiting until >> the timeout expires. Instead, the backends should periodically check >> whether postmaster is alive, and then they should exit immediately >> if it's not alive, as well as other process does? If the timeout is >> disabled, such backends would get stuck infinitely. > > Will wake them every 60 seconds I don't really see why sync rep should be responsible for solving this problem, which is an issue in many other situations as well, only for itself. In fact I think I'd prefer that it didn't, and that we wait for a more general solution that will actually fix this problem for real. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2011-03-05 18:25, Yeb Havinga wrote:
1) it is confusing to show messages/ contents of stat_replication that hints at syncrep, when synchronous_replication is on.
2) should guc settings for synchronous_replication be changed so it can only be set in the config file, and hence only change with reload_conf()?
3) speed is comparable to v17 :-)
regards,
Yeb Havinga
So the biggest change is perhaps that you cannot start 'working' immediately after a initdb with synchronous_replication=on, without a connected standby; I needed to create a role for the repuser to make a backup, but the master halted. The initial bootstrapping has to be done with synchronous_replication = off. So I did, made backup, started standby's while still in not synchronous mode. What followed was confusing:
LOG: 00000: standby "standby2" is now the synchronous standby with priority 2
postgres=# show synchronous_replication ; show synchronous_standby_names; select application_name,state,sync_state from pg_stat_replication;
synchronous_replication
-------------------------
off
(1 row)
synchronous_standby_names
----------------------------
standby1,standby2,standby3
(1 row)
application_name | state | sync_state
------------------+-----------+------------
standby2 | STREAMING | SYNC
asyncone | STREAMING | ASYNC
(2 rows)
Is it really sync?
pgbench test got 1464 tps.. seems a bit high.
psql postgres, set synchronous_replication = on;
- no errors, and show after disconnect showed this parameter was still on. My guess: we have syncrep! A restart or reload config was not necessary.
pgbench test got 1460 tps.
pg_reload_conf(); with syncrep = on in postgresql.conf
pgbench test got 819 tps
So now this is synchronous.
Disabled the asynchronous standby
pgbench test got 920 tps.
I also got a first first > 1000 tps score :-) (yeah you have to believe me there really was a sync standby server)
$ pgbench -c 10 -M prepared -T 30 test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 50
query mode: prepared
number of clients: 10
number of threads: 1
duration: 30 s
number of transactions actually processed: 30863
tps = 1027.493807 (including connections establishing)
tps = 1028.183618 (excluding connections establishing)
On Sat, Mar 5, 2011 at 5:53 PM, Simon Riggs <simon@2ndquadrant.com> wrote:Summary of preliminary testing:
On a positive note this is one less parameter and will improve
performance as well.
All above changes made.
Ready to commit, barring concrete objections to important behaviour.
I will do one final check tomorrow evening then commit.
Will retest with new version this evening. Also curious to performance improvement, since v17 seems to be topscorer in that department.
1) it is confusing to show messages/ contents of stat_replication that hints at syncrep, when synchronous_replication is on.
2) should guc settings for synchronous_replication be changed so it can only be set in the config file, and hence only change with reload_conf()?
3) speed is comparable to v17 :-)
regards,
Yeb Havinga
So the biggest change is perhaps that you cannot start 'working' immediately after a initdb with synchronous_replication=on, without a connected standby; I needed to create a role for the repuser to make a backup, but the master halted. The initial bootstrapping has to be done with synchronous_replication = off. So I did, made backup, started standby's while still in not synchronous mode. What followed was confusing:
LOG: 00000: standby "standby2" is now the synchronous standby with priority 2
postgres=# show synchronous_replication ; show synchronous_standby_names; select application_name,state,sync_state from pg_stat_replication;
synchronous_replication
-------------------------
off
(1 row)
synchronous_standby_names
----------------------------
standby1,standby2,standby3
(1 row)
application_name | state | sync_state
------------------+-----------+------------
standby2 | STREAMING | SYNC
asyncone | STREAMING | ASYNC
(2 rows)
Is it really sync?
pgbench test got 1464 tps.. seems a bit high.
psql postgres, set synchronous_replication = on;
- no errors, and show after disconnect showed this parameter was still on. My guess: we have syncrep! A restart or reload config was not necessary.
pgbench test got 1460 tps.
pg_reload_conf(); with syncrep = on in postgresql.conf
pgbench test got 819 tps
So now this is synchronous.
Disabled the asynchronous standby
pgbench test got 920 tps.
I also got a first first > 1000 tps score :-) (yeah you have to believe me there really was a sync standby server)
$ pgbench -c 10 -M prepared -T 30 test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 50
query mode: prepared
number of clients: 10
number of threads: 1
duration: 30 s
number of transactions actually processed: 30863
tps = 1027.493807 (including connections establishing)
tps = 1028.183618 (excluding connections establishing)
On 2011-03-05 21:11, Yeb Havinga wrote: <blockquote cite="mid:4D7298F8.9070905@gmail.com" type="cite"></blockquote> Summaryof preliminary testing:<br /> 1) it is confusing to show messages/ contents of stat_replication that hints at syncrep,when synchronous_replication is on.<br /> s/on/off/<br /><br /> Also forgot to mention these tests are againt thelatest v21 syncrep patch.<br /><br /> Y<br /><br />
On Sat, Mar 5, 2011 at 3:11 PM, Yeb Havinga <yebhavinga@gmail.com> wrote: > > Summary of preliminary testing: > 1) it is confusing to show messages/ contents of stat_replication that hints > at syncrep, when synchronous_replication is on. [for the record, Yeb explain he means OFF not on...] the thing is that once you put a server name in synchronous_standby_names that standby is declared to be synchronous and because synchronous_replication can change at any time for any given backend we need to know priority of any declared sync standby. if you want pure async standbys remove their names from synchronous_standby_names > 2) should guc settings for synchronous_replication be changed so it can only > be set in the config file, and hence only change with reload_conf()? no. the thing is that we can determine for which data we want to pay the price of synchrony and for which data don't > 3) speed is comparable to v17 :-) > yeah... it's a lot better than before, good work Simon :) -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte y capacitación de PostgreSQL
On Sun, Mar 6, 2011 at 2:59 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Mar 5, 2011 at 11:56 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> Even though postmaster dies, the waiting backend keeps waiting until >>> the timeout expires. Instead, the backends should periodically check >>> whether postmaster is alive, and then they should exit immediately >>> if it's not alive, as well as other process does? If the timeout is >>> disabled, such backends would get stuck infinitely. >> >> Will wake them every 60 seconds > > I don't really see why sync rep should be responsible for solving this > problem, which is an issue in many other situations as well, only for > itself. In fact I think I'd prefer that it didn't, and that we wait > for a more general solution that will actually fix this problem for > real. I agree if such a general solution will be committed together with sync rep. Otherwise, because of sync rep, the backend can easily get stuck *infinitely*. When postmaster is not alive, all the existing walsenders exit immediately and no new walsender can appear. So there is no way to release the waiting backend. I think that some solutions for this issue which is likely to happen are required. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Sun, Mar 6, 2011 at 1:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Sat, 2011-03-05 at 20:08 +0900, Fujii Masao wrote: >> On Sat, Mar 5, 2011 at 7:28 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> > Yes, that can happen. As people will no doubt observe, this seems to be >> > an argument for wait-forever. What we actually need is a wait that lasts >> > longer than it takes for us to decide to failover, if the standby is >> > actually up and this is some kind of split brain situation. That way the >> > clients are still waiting when failover occurs. WAL is missing, but >> > since we didn't acknowledge the client we are OK to treat that situation >> > as if it were an abort. >> >> Oracle Data Guard in the maximum availability mode behaves that way? >> >> I'm sure that you are implementing something like the maximum availability >> mode rather than the maximum protection one. So I'd like to know how >> the data loss situation I described can be avoided in the maximum availability >> mode. > > It can't. (Oracle or otherwise...) > > Once we begin waiting for sync rep, if the transaction or backend ends > then other backends will be able to see the changed data. The only way > to prevent that is to shutdown the database to ensure that no readers or > writers have access to that. > > Oracle's protection mechanism is to shutdown the primary if there is no > sync standby available. Maximum Protection. Any other mode must > therefore be less than maximum protection, according to Oracle, and me. > "Available" here means one that has not timed out, via parameter. > > Shutting down the main server is cool, as long as you failover to one of > the standbys. If there aren't any standbys, or you don't have a > mechanism for switching quickly, you have availability problems. > > What shutting down the server doesn't do is keep the data safe for > transactions that were in their commit-wait phase when the disconnect > occurs. That data exists, yet will not have been transferred to the > standby. > > >From now, I also say we should wait forever. It is the safest mode and I > want no argument about whether sync rep is safe or not. We can introduce > a more relaxed mode later with high availability for the primary. That > is possible and in some cases desirable. > > Now, when we lose last sync standby we have three choices: > > 1. reconnect the standby, or wait for a potential standby to catchup > > 2. immediate shutdown of master and failover to one of the standbys > > 3. reclassify an async standby as a sync standby > > More than likely we would attempt to do (1) for a while, then do (2) > > This means that when we startup the primary will freeze for a while > until the sync standbys connect. Which is OK, since they try to > reconnect every 5 seconds and we don't plan on shutting down the primary > much anyway. > > I've removed the timeout parameter, plus if we begin waiting we wait > until released, forever if that's how long it takes. > > The recommendation to use more than one standby remains. > > Fast shutdown will wake backends from their latch and there isn't any > changed interrupt behaviour any more. > > synchronous_standby_names = '*' is no longer the default > > On a positive note this is one less parameter and will improve > performance as well. > > All above changes made. > > Ready to commit, barring concrete objections to important behaviour. > > I will do one final check tomorrow evening then commit. I agree with this change. One comment; what about introducing built-in function to wake up all the waiting backends? When replication connection is closed, if we STONITH the standby, we can safely (for not physical data loss but logical one) switch the primary to standalone mode. But there is no way to wake up the waiting backends for now. Setting synchronous_replication to OFF and reloading the configuration file doesn't affect the existing waiting backends. The attached patch introduces the "pg_wakeup_all_waiters" (better name?) function which wakes up all the backends on the queue. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
On Sun, Mar 6, 2011 at 4:51 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > One comment; what about introducing built-in function to wake up all the > waiting backends? When replication connection is closed, if we STONITH > the standby, we can safely (for not physical data loss but logical one) > switch the primary to standalone mode. But there is no way to wake up > the waiting backends for now. Setting synchronous_replication to OFF > and reloading the configuration file doesn't affect the existing waiting > backends. The attached patch introduces the "pg_wakeup_all_waiters" > (better name?) function which wakes up all the backends on the queue. If unfortunately all connection slots are used by backends waiting for replication, we cannot execute such a function. So it makes more sense to introduce something like "pg_ctl standalone" command? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Sun, 2011-03-06 at 14:27 +0900, Fujii Masao wrote: > On Sun, Mar 6, 2011 at 2:59 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Sat, Mar 5, 2011 at 11:56 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > >>> Even though postmaster dies, the waiting backend keeps waiting until > >>> the timeout expires. Instead, the backends should periodically check > >>> whether postmaster is alive, and then they should exit immediately > >>> if it's not alive, as well as other process does? If the timeout is > >>> disabled, such backends would get stuck infinitely. > >> > >> Will wake them every 60 seconds > > > > I don't really see why sync rep should be responsible for solving this > > problem, which is an issue in many other situations as well, only for > > itself. In fact I think I'd prefer that it didn't, and that we wait > > for a more general solution that will actually fix this problem for > > real. > > I agree if such a general solution will be committed together with sync rep. > Otherwise, because of sync rep, the backend can easily get stuck *infinitely*. > When postmaster is not alive, all the existing walsenders exit immediately > and no new walsender can appear. So there is no way to release the > waiting backend. I think that some solutions for this issue which is likely to > happen are required. Completely agree. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sun, 2011-03-06 at 16:58 +0900, Fujii Masao wrote: > On Sun, Mar 6, 2011 at 4:51 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > > One comment; what about introducing built-in function to wake up all the > > waiting backends? When replication connection is closed, if we STONITH > > the standby, we can safely (for not physical data loss but logical one) > > switch the primary to standalone mode. But there is no way to wake up > > the waiting backends for now. Setting synchronous_replication to OFF > > and reloading the configuration file doesn't affect the existing waiting > > backends. The attached patch introduces the "pg_wakeup_all_waiters" > > (better name?) function which wakes up all the backends on the queue. > > If unfortunately all connection slots are used by backends waiting for > replication, we cannot execute such a function. So it makes more sense > to introduce something like "pg_ctl standalone" command? Well, there is one way to end the wait: shutdown, or use pg_terminate_backend(). If you simply end the wait you will get COMMIT messages. What I would like to do is commit the "safe" patch now. We can then discuss whether it is safe and desirable to relax some aspects of that during beta. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sun, 2011-03-06 at 01:58 +0900, Fujii Masao wrote: > On Sun, Mar 6, 2011 at 12:42 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > > New comments; > > Another one; > > + long timeout = SyncRepGetWaitTimeout(); > <snip> > + else if (timeout > 0 && > + TimestampDifferenceExceeds(wait_start, now, timeout)) > + { > > The third argument of TimestampDifferenceExceeds is > expressed in milliseconds. But you wrongly give that the > microseconds. Just for the record: that code section is now removed in v21 -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sun, Mar 6, 2011 at 5:26 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Sun, 2011-03-06 at 16:58 +0900, Fujii Masao wrote: >> On Sun, Mar 6, 2011 at 4:51 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> > One comment; what about introducing built-in function to wake up all the >> > waiting backends? When replication connection is closed, if we STONITH >> > the standby, we can safely (for not physical data loss but logical one) >> > switch the primary to standalone mode. But there is no way to wake up >> > the waiting backends for now. Setting synchronous_replication to OFF >> > and reloading the configuration file doesn't affect the existing waiting >> > backends. The attached patch introduces the "pg_wakeup_all_waiters" >> > (better name?) function which wakes up all the backends on the queue. >> >> If unfortunately all connection slots are used by backends waiting for >> replication, we cannot execute such a function. So it makes more sense >> to introduce something like "pg_ctl standalone" command? > > Well, there is one way to end the wait: shutdown, or use > pg_terminate_backend(). Immediate shutdown can release the wait. But smart and fast shutdown cannot. Also pg_terminate_backend() cannot. Since a backend is waiting on the latch and InterruptHoldoffCount is not zero, only SetLatch() or SIGQUIT can cause it to end. > If you simply end the wait you will get COMMIT messages. > > What I would like to do is commit the "safe" patch now. We can then > discuss whether it is safe and desirable to relax some aspects of that > during beta. OK if changing some aspects is acceptable during beta. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Sun, Mar 6, 2011 at 5:02 PM, Yeb Havinga <yebhavinga@gmail.com> wrote: > On Sun, Mar 6, 2011 at 8:58 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> >> If unfortunately all connection slots are used by backends waiting for >> replication, we cannot execute such a function. So it makes more sense >> to introduce something like "pg_ctl standalone" command? > > If it is only for shutdown, maybe pg_ctl stop -m standalone? It's for not only shutdown but also running the primary in standalone mode. So something like "pg_ctl standalone" is better. For now I think that pg_ctl command is better than built-in function because sometimes we might want to wake waiters up even during shutdown in order to cause shutdown to end. During shutdown, the server doesn't accept any new connection (even from the standby). So, without something like "pg_ctl standalone", there is no way to cause shutdown to end. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
<p><p>El 06/03/2011 03:26, "Simon Riggs" <<a href="mailto:simon@2ndquadrant.com">simon@2ndquadrant.com</a>> escribió:<br/> ><br /> > On Sun, 2011-03-06 at 16:58 +0900, Fujii Masao wrote:<p>> > If unfortunately all connectionslots are used by backends waiting for<br /> > > replication, we cannot execute such a function. So it makesmore sense<br /> > > to introduce something like "pg_ctl standalone" command?<br /> ><br /> > Well, thereis one way to end the wait: shutdown, or use<br /> > pg_terminate_backend().<br /> ><p>I disconnected all standbysso the master keeps waiting on commit. Then i shutdown the master with immediate and got a crash. i wasn't able totrace that though<p>--<br /> Jaime Casanova <a href="http://www.2ndQuadrant.com">www.2ndQuadrant.com</a><br/>
On Mar 6, 2011, at 9:44 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Sun, Mar 6, 2011 at 5:02 PM, Yeb Havinga <yebhavinga@gmail.com> wrote: >> On Sun, Mar 6, 2011 at 8:58 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> >>> If unfortunately all connection slots are used by backends waiting for >>> replication, we cannot execute such a function. So it makes more sense >>> to introduce something like "pg_ctl standalone" command? >> >> If it is only for shutdown, maybe pg_ctl stop -m standalone? > > It's for not only shutdown but also running the primary in standalone mode. > So something like "pg_ctl standalone" is better. > > For now I think that pg_ctl command is better than built-in function because > sometimes we might want to wake waiters up even during shutdown in > order to cause shutdown to end. During shutdown, the server doesn't > accept any new connection (even from the standby). So, without something > like "pg_ctl standalone", there is no way to cause shutdown to end. This sounds like an awful hack to work around a bad design. Surely once shutdown reaches a point where new replication connectionscan no longer be accepted, any standbys hung on commit need to close the connection without responding to theCOMMIT, per previous discussion. It's completely unreasonable for sync rep to break the shutdown sequence. ...Robert
On Sun, 2011-03-06 at 16:51 +0900, Fujii Masao wrote: > One comment; what about introducing built-in function to wake up all the > waiting backends? When replication connection is closed, if we STONITH > the standby, we can safely (for not physical data loss but logical one) > switch the primary to standalone mode. But there is no way to wake up > the waiting backends for now. Setting synchronous_replication to OFF > and reloading the configuration file doesn't affect the existing waiting > backends. The attached patch introduces the "pg_wakeup_all_waiters" > (better name?) function which wakes up all the backends on the queue. Will apply this as a separate commit. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sat, 2011-03-05 at 21:11 +0100, Yeb Havinga wrote: > I also got a first first > 1000 tps score The committed version should be even faster. Would appreciate a retest. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 2011-03-07 01:37, Simon Riggs wrote: > On Sat, 2011-03-05 at 21:11 +0100, Yeb Havinga wrote: > >> I also got a first first> 1000 tps score > The committed version should be even faster. Would appreciate a retest. > pgbench 5 minute test pgbench -c 10 -M prepared -T 300 test dbsize was -s 50, 1Gbit Ethernet 1 async standby tps = 2475.285931 (excluding connections establishing) 2 async standbys tps = 2333.670561 (excluding connections establishing) 1 sync standby tps = 1277.082753 (excluding connections establishing) 1 sync, 1 async standby tps = 1273.317386 (excluding connections establishing) Hard for me to not revert to superlatives right now! :-) regards, Yeb Havinga
On Mon, 2011-03-07 at 14:20 +0100, Yeb Havinga wrote: > On 2011-03-07 01:37, Simon Riggs wrote: > > On Sat, 2011-03-05 at 21:11 +0100, Yeb Havinga wrote: > > > >> I also got a first first> 1000 tps score > > The committed version should be even faster. Would appreciate a retest. > > > pgbench 5 minute test pgbench -c 10 -M prepared -T 300 test > dbsize was -s 50, 1Gbit Ethernet > > 1 async standby > tps = 2475.285931 (excluding connections establishing) > > 2 async standbys > tps = 2333.670561 (excluding connections establishing) > > 1 sync standby > tps = 1277.082753 (excluding connections establishing) > > 1 sync, 1 async standby > tps = 1273.317386 (excluding connections establishing) > > Hard for me to not revert to superlatives right now! :-) That looks like good news, thanks. It shows that sync rep is "fairly fast", but it also shows clearly why you'd want to mix sync and async replication within an application. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Mon, Mar 7, 2011 at 4:54 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mar 6, 2011, at 9:44 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Sun, Mar 6, 2011 at 5:02 PM, Yeb Havinga <yebhavinga@gmail.com> wrote: >>> On Sun, Mar 6, 2011 at 8:58 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>>> >>>> If unfortunately all connection slots are used by backends waiting for >>>> replication, we cannot execute such a function. So it makes more sense >>>> to introduce something like "pg_ctl standalone" command? >>> >>> If it is only for shutdown, maybe pg_ctl stop -m standalone? >> >> It's for not only shutdown but also running the primary in standalone mode. >> So something like "pg_ctl standalone" is better. >> >> For now I think that pg_ctl command is better than built-in function because >> sometimes we might want to wake waiters up even during shutdown in >> order to cause shutdown to end. During shutdown, the server doesn't >> accept any new connection (even from the standby). So, without something >> like "pg_ctl standalone", there is no way to cause shutdown to end. > > This sounds like an awful hack to work around a bad design. Surely once shutdown reaches a point where new replicationconnections can no longer be accepted, any standbys hung on commit need to close the connection without respondingto the COMMIT, per previous discussion. It's completely unreasonable for sync rep to break the shutdown sequence. Yeah, let's think about how shutdown should work. I'd like to propose the following. Thought? * Smart shutdown Smart shutdown should wait for all the waiting backends to be acked, and should not cause them to forcibly exit. But this leads shutdown to get stuck infinitely if there is no walsender at that time. To enable them to be acked even in that situation, we need to change postmaster so that it accepts the replication connection even during smart shutdown (until we reach PM_SHUTDOWN_2 state). Postmaster has already accepted the superuser connection to cancel backup during smart shutdown. So I don't think that the idea to accept the replication connection during smart shutdown is so ugly. * Fast shutdown I agree with you about fast shutdown. Fast shutdown should cause all the backends including waiting ones to exit immediately. At that time, the non-acked backend should not return the success, according to the definition of sync rep. So we need to change a backend so that it gets rid of itself from the waiting queue and exits before returning the success, when it receives SIGTERM. This change leads the waiting backends to do the same even when pg_terminate_backend is called. But since they've not been acked yet, it seems to be reasonable to prevent them from returning the COMMIT. Comments? I'll create the patch barring objection. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Mar 8, 2011 at 7:05 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > Yeah, let's think about how shutdown should work. I'd like to propose the > following. Thought? > > * Smart shutdown > Smart shutdown should wait for all the waiting backends to be acked, and > should not cause them to forcibly exit. But this leads shutdown to get stuck > infinitely if there is no walsender at that time. To enable them to be acked > even in that situation, we need to change postmaster so that it accepts the > replication connection even during smart shutdown (until we reach > PM_SHUTDOWN_2 state). Postmaster has already accepted the superuser > connection to cancel backup during smart shutdown. So I don't think that > the idea to accept the replication connection during smart shutdown is so > ugly. > > * Fast shutdown > I agree with you about fast shutdown. Fast shutdown should cause all the > backends including waiting ones to exit immediately. At that time, the > non-acked backend should not return the success, according to the > definition of sync rep. So we need to change a backend so that it gets rid > of itself from the waiting queue and exits before returning the success, > when it receives SIGTERM. This change leads the waiting backends to > do the same even when pg_terminate_backend is called. But since > they've not been acked yet, it seems to be reasonable to prevent them > from returning the COMMIT. The fast shutdown handling seems fine, but why not just handle smart shutdown the same way? I don't really like the idea of allowing replication connections for longer, and the idea that we don't want to keep waiting for a commit ACK once we're past the point where it's possible for one to occur seems to apply generically to any shutdown sequence. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 8, 2011 at 11:58 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > The fast shutdown handling seems fine, but why not just handle smart > shutdown the same way? currently, smart shutdown means no new connections, wait until existing ones close normally. for consistency, it should behave the same for sync rep. +1 for Fujii's proposal -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte y capacitación de PostgreSQL
Simon Riggs wrote: > On Fri, 2011-03-04 at 23:15 +0900, Fujii Masao wrote: > > > postgres=# SELECT application_name, state, sync_priority, sync_state > > FROM pg_stat_replication; > > application_name | state | sync_priority | sync_state > > ------------------+-----------+---------------+------------ > > one | STREAMING | 1 | POTENTIAL > > two | STREAMING | 2 | SYNC > > (2 rows) > > Bug! Thanks. Is there a reason these status are all upper-case? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Wed, Mar 9, 2011 at 9:21 PM, Bruce Momjian <bruce@momjian.us> wrote: > Simon Riggs wrote: >> On Fri, 2011-03-04 at 23:15 +0900, Fujii Masao wrote: >> >> > postgres=# SELECT application_name, state, sync_priority, sync_state >> > FROM pg_stat_replication; >> > application_name | state | sync_priority | sync_state >> > ------------------+-----------+---------------+------------ >> > one | STREAMING | 1 | POTENTIAL >> > two | STREAMING | 2 | SYNC >> > (2 rows) >> >> Bug! Thanks. > > Is there a reason these status are all upper-case? Not that I know of. However, I think that some more fundamental rethinking of the "state" mechanism may be in order. When Magnus first committed this, it would say CATCHUP whenever you were behind (even if only momentarily) and STREAMING if you were caught up. Simon then changed it so that it says CATCHUP until you catch up the first time, and then STREAMING afterward (even if you fall behind again). Neither behavior seems completely adequate to me. I think we should have a way to know whether we've ever been caught up, and if so when the most recent time was. So you could then say things like "is the most recent time at which the standby was caught up within the last 30 seconds?", which would be a useful thing to monitor, and right now there's no way to do it. There's also a BACKUP state, but I'm not sure it makes sense to lump that in with the others. Some day it might be possible to stream WAL and take a backup at the same time, over the same connection. Maybe that should be a separate column or something. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 2011-03-09 at 21:21 -0500, Bruce Momjian wrote: > Simon Riggs wrote: > > On Fri, 2011-03-04 at 23:15 +0900, Fujii Masao wrote: > > > > > postgres=# SELECT application_name, state, sync_priority, sync_state > > > FROM pg_stat_replication; > > > application_name | state | sync_priority | sync_state > > > ------------------+-----------+---------------+------------ > > > one | STREAMING | 1 | POTENTIAL > > > two | streaming | 2 | sync > > > (2 rows) > > > > Bug! Thanks. > > Is there a reason these status are all upper-case? NOT AS FAR AS I KNOW. I'll add it to the list of changes for beta. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
Robert Haas <robertmhaas@gmail.com> writes: > was. So you could then say things like "is the most recent time at > which the standby was caught up within the last 30 seconds?", which > would be a useful thing to monitor, and right now there's no way to do Well in my experience with replication, that's not what I want to monitor. If the standby is synchronous, then it's not catching up, it's streaming. If it were not, it would not be a synchronous standby. When a standby is asynchronous then what I want to monitor is its lag. So the CATCHUP state is useful to see that a synchronous standby candidate can not yet be a synchronous standby. When it just lost its synchronous status (and hopefully another standby is now the sync one), then it's just asynchronous and I want to know its lag. > it. There's also a BACKUP state, but I'm not sure it makes sense to > lump that in with the others. Some day it might be possible to stream > WAL and take a backup at the same time, over the same connection. > Maybe that should be a separate column or something. BACKUP is still meaningful if you stream WAL at the same time, because you're certainly *not* applying them while doing the base backup, are you? So you're not yet a standby, that's what BACKUP means. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Thu, Mar 10, 2011 at 2:42 PM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> was. So you could then say things like "is the most recent time at >> which the standby was caught up within the last 30 seconds?", which >> would be a useful thing to monitor, and right now there's no way to do > > Well in my experience with replication, that's not what I want to > monitor. If the standby is synchronous, then it's not catching up, it's > streaming. If it were not, it would not be a synchronous standby. > > When a standby is asynchronous then what I want to monitor is its lag. > > So the CATCHUP state is useful to see that a synchronous standby > candidate can not yet be a synchronous standby. When it just lost its > synchronous status (and hopefully another standby is now the sync one), > then it's just asynchronous and I want to know its lag. Yeah, maybe. The trick is how to measure the lag. I proposed the above scheme mostly as a way of giving the user some piece of information that can be measured in seconds rather than WAL position, but I'm open to better ideas. Monitoring is pretty hard to do at all in 9.0; in 9.1, we'll be able to tell them how many *bytes* behind they are, but there's no easy way to figure out what that means in terms of wall-clock time, which I think would be useful. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > they are, but there's no easy way to figure out what that means in > terms of wall-clock time, which I think would be useful. Jan Wieck had a detailed proposal to make that happen at last developper meeting, but then ran out of time to implement it for 9.1 it seems. The idea was basically to have a ticker in core, an SRF that would associate txid_snapshot with wall clock time. Lots of good things would come from that. http://archives.postgresql.org/pgsql-hackers/2010-05/msg01209.php Of course if you think that's important enough for you to implement it between now and beta, that would be great :) Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Thu, Mar 10, 2011 at 3:29 PM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> they are, but there's no easy way to figure out what that means in >> terms of wall-clock time, which I think would be useful. > > Jan Wieck had a detailed proposal to make that happen at last developper > meeting, but then ran out of time to implement it for 9.1 it seems. The > idea was basically to have a ticker in core, an SRF that would associate > txid_snapshot with wall clock time. Lots of good things would come from > that. > > http://archives.postgresql.org/pgsql-hackers/2010-05/msg01209.php > > Of course if you think that's important enough for you to implement it > between now and beta, that would be great :) I think that's actually something a little different, and more complicated, but I do think it'd be useful. I was hoping there was a simple way to get some kind of time-based information into pg_stat_replication, but if there isn't, there isn't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 11, 2011 at 5:50 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Mar 10, 2011 at 3:29 PM, Dimitri Fontaine > <dimitri@2ndquadrant.fr> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> they are, but there's no easy way to figure out what that means in >>> terms of wall-clock time, which I think would be useful. >> >> Jan Wieck had a detailed proposal to make that happen at last developper >> meeting, but then ran out of time to implement it for 9.1 it seems. The >> idea was basically to have a ticker in core, an SRF that would associate >> txid_snapshot with wall clock time. Lots of good things would come from >> that. >> >> http://archives.postgresql.org/pgsql-hackers/2010-05/msg01209.php >> >> Of course if you think that's important enough for you to implement it >> between now and beta, that would be great :) > > I think that's actually something a little different, and more > complicated, but I do think it'd be useful. I was hoping there was a > simple way to get some kind of time-based information into > pg_stat_replication, but if there isn't, there isn't. How about sending the timestamp of last applied transaction (i.e., this is the return value of pg_last_xact_replay_timestamp) from the standby to the master, and reporting it in pg_stat_replication? Then you can see the lag by comparing it with current_timestamp. But since the last replay timestamp doesn't advance (but current timestamp advances) if there is no work on the master, the calculated lag might be unexpectedly too large. So, to calculate the exact lag, I'm thinking that we should introduce new function which returns the timestamp of the last transaction written in the master. Thought? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Mar 11, 2011 at 7:08 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Fri, Mar 11, 2011 at 5:50 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Mar 10, 2011 at 3:29 PM, Dimitri Fontaine >> <dimitri@2ndquadrant.fr> wrote: >>> Robert Haas <robertmhaas@gmail.com> writes: >>>> they are, but there's no easy way to figure out what that means in >>>> terms of wall-clock time, which I think would be useful. >>> >>> Jan Wieck had a detailed proposal to make that happen at last developper >>> meeting, but then ran out of time to implement it for 9.1 it seems. The >>> idea was basically to have a ticker in core, an SRF that would associate >>> txid_snapshot with wall clock time. Lots of good things would come from >>> that. >>> >>> http://archives.postgresql.org/pgsql-hackers/2010-05/msg01209.php >>> >>> Of course if you think that's important enough for you to implement it >>> between now and beta, that would be great :) >> >> I think that's actually something a little different, and more >> complicated, but I do think it'd be useful. I was hoping there was a >> simple way to get some kind of time-based information into >> pg_stat_replication, but if there isn't, there isn't. > > How about sending the timestamp of last applied transaction > (i.e., this is the return value of pg_last_xact_replay_timestamp) > from the standby to the master, and reporting it in > pg_stat_replication? Then you can see the lag by comparing > it with current_timestamp. > > But since the last replay timestamp doesn't advance (but > current timestamp advances) if there is no work on the master, > the calculated lag might be unexpectedly too large. So, to > calculate the exact lag, I'm thinking that we should introduce > new function which returns the timestamp of the last transaction > written in the master. > > Thought? Hmm... where would we get that value from? And what if no transactions are running on the master? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 11, 2011 at 10:02 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> How about sending the timestamp of last applied transaction >> (i.e., this is the return value of pg_last_xact_replay_timestamp) >> from the standby to the master, and reporting it in >> pg_stat_replication? Then you can see the lag by comparing >> it with current_timestamp. >> >> But since the last replay timestamp doesn't advance (but >> current timestamp advances) if there is no work on the master, >> the calculated lag might be unexpectedly too large. So, to >> calculate the exact lag, I'm thinking that we should introduce >> new function which returns the timestamp of the last transaction >> written in the master. >> >> Thought? > > Hmm... where would we get that value from? xl_xact_commit->xact_time (which is set in RecordTransactionCommit) and xl_xact_abort->xact_time (which is set in RecordTransactionAbort). > And what if no > transactions are running on the master? In that case, the last write WAL timestamp would become equal to the last replay WAL timestamp. So we can see that there is no lag. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Mar 11, 2011 at 8:21 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Fri, Mar 11, 2011 at 10:02 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> How about sending the timestamp of last applied transaction >>> (i.e., this is the return value of pg_last_xact_replay_timestamp) >>> from the standby to the master, and reporting it in >>> pg_stat_replication? Then you can see the lag by comparing >>> it with current_timestamp. >>> >>> But since the last replay timestamp doesn't advance (but >>> current timestamp advances) if there is no work on the master, >>> the calculated lag might be unexpectedly too large. So, to >>> calculate the exact lag, I'm thinking that we should introduce >>> new function which returns the timestamp of the last transaction >>> written in the master. >>> >>> Thought? >> >> Hmm... where would we get that value from? > > xl_xact_commit->xact_time (which is set in RecordTransactionCommit) > and xl_xact_abort->xact_time (which is set in RecordTransactionAbort). > >> And what if no >> transactions are running on the master? > > In that case, the last write WAL timestamp would become equal to the > last replay WAL timestamp. So we can see that there is no lag. Oh, I see (I think). You're talking about write/replay lag, but I was thinking of master/slave transmission lag. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 11, 2011 at 09:03:33AM -0500, Robert Haas wrote: > On Fri, Mar 11, 2011 at 8:21 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > > > > In that case, the last write WAL timestamp would become equal to the > > last replay WAL timestamp. So we can see that there is no lag. > > Oh, I see (I think). You're talking about write/replay lag, but I was > thinking of master/slave transmission lag. > Which are both useful numbers to know: the first tells you how "stale" queries from a Hot Standby will be, the second tells you the maximum data loss from a "meteor hits the master" scenario where that slave is promoted, if I understand all the interactions correctly. Ross -- Ross Reedstrom, Ph.D. reedstrm@rice.edu Systems Engineer & Admin, Research Scientist phone: 713-348-6166 Connexions http://cnx.org fax: 713-348-3665 Rice University MS-375, Houston, TX 77005 GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E F888 D3AE 810E 88F0 BEDE
On Tue, Mar 8, 2011 at 7:05 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > * Smart shutdown > Smart shutdown should wait for all the waiting backends to be acked, and > should not cause them to forcibly exit. But this leads shutdown to get stuck > infinitely if there is no walsender at that time. To enable them to be acked > even in that situation, we need to change postmaster so that it accepts the > replication connection even during smart shutdown (until we reach > PM_SHUTDOWN_2 state). Postmaster has already accepted the superuser > connection to cancel backup during smart shutdown. So I don't think that > the idea to accept the replication connection during smart shutdown is so > ugly. > > * Fast shutdown > I agree with you about fast shutdown. Fast shutdown should cause all the > backends including waiting ones to exit immediately. At that time, the > non-acked backend should not return the success, according to the > definition of sync rep. So we need to change a backend so that it gets rid > of itself from the waiting queue and exits before returning the success, > when it receives SIGTERM. This change leads the waiting backends to > do the same even when pg_terminate_backend is called. But since > they've not been acked yet, it seems to be reasonable to prevent them > from returning the COMMIT. > > Comments? I'll create the patch barring objection. The fast smart shutdown part of this problem has been addressed. The smart shutdown case still needs work, and I think the consensus was that your proposal above was the best way to go with it. Do you still want to work up a patch for this? If so, I can review. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 18, 2011 at 10:25 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Mar 8, 2011 at 7:05 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> * Smart shutdown >> Smart shutdown should wait for all the waiting backends to be acked, and >> should not cause them to forcibly exit. But this leads shutdown to get stuck >> infinitely if there is no walsender at that time. To enable them to be acked >> even in that situation, we need to change postmaster so that it accepts the >> replication connection even during smart shutdown (until we reach >> PM_SHUTDOWN_2 state). Postmaster has already accepted the superuser >> connection to cancel backup during smart shutdown. So I don't think that >> the idea to accept the replication connection during smart shutdown is so >> ugly. >> >> * Fast shutdown >> I agree with you about fast shutdown. Fast shutdown should cause all the >> backends including waiting ones to exit immediately. At that time, the >> non-acked backend should not return the success, according to the >> definition of sync rep. So we need to change a backend so that it gets rid >> of itself from the waiting queue and exits before returning the success, >> when it receives SIGTERM. This change leads the waiting backends to >> do the same even when pg_terminate_backend is called. But since >> they've not been acked yet, it seems to be reasonable to prevent them >> from returning the COMMIT. >> >> Comments? I'll create the patch barring objection. > > The fast smart shutdown part of this problem has been addressed. The Ugh. I mean "the fast shutdown", of course, not "the fast smart shutdown". Anyway, point is: fast shutdown now OK smart shutdown still not OK do you want to write a patch? :-) > smart shutdown case still needs work, and I think the consensus was > that your proposal above was the best way to go with it. > > Do you still want to work up a patch for this? If so, I can review. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Mar 19, 2011 at 11:28 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Mar 18, 2011 at 10:25 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Mar 8, 2011 at 7:05 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> * Smart shutdown >>> Smart shutdown should wait for all the waiting backends to be acked, and >>> should not cause them to forcibly exit. But this leads shutdown to get stuck >>> infinitely if there is no walsender at that time. To enable them to be acked >>> even in that situation, we need to change postmaster so that it accepts the >>> replication connection even during smart shutdown (until we reach >>> PM_SHUTDOWN_2 state). Postmaster has already accepted the superuser >>> connection to cancel backup during smart shutdown. So I don't think that >>> the idea to accept the replication connection during smart shutdown is so >>> ugly. >>> >>> * Fast shutdown >>> I agree with you about fast shutdown. Fast shutdown should cause all the >>> backends including waiting ones to exit immediately. At that time, the >>> non-acked backend should not return the success, according to the >>> definition of sync rep. So we need to change a backend so that it gets rid >>> of itself from the waiting queue and exits before returning the success, >>> when it receives SIGTERM. This change leads the waiting backends to >>> do the same even when pg_terminate_backend is called. But since >>> they've not been acked yet, it seems to be reasonable to prevent them >>> from returning the COMMIT. >>> >>> Comments? I'll create the patch barring objection. >> >> The fast smart shutdown part of this problem has been addressed. The > > Ugh. I mean "the fast shutdown", of course, not "the fast smart > shutdown". Anyway, point is: > > fast shutdown now OK > smart shutdown still not OK > do you want to write a patch? > > :-) > >> smart shutdown case still needs work, and I think the consensus was >> that your proposal above was the best way to go with it. >> >> Do you still want to work up a patch for this? If so, I can review. Sure. Will do. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Mar 23, 2011 at 5:53 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> Do you still want to work up a patch for this? If so, I can review. > > Sure. Will do. The attached patch allows standby servers to connect during smart shutdown in order to wake up backends waiting for sync rep. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
On Thu, Mar 24, 2011 at 11:17 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, Mar 23, 2011 at 5:53 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>>> Do you still want to work up a patch for this? If so, I can review. >> >> Sure. Will do. > > The attached patch allows standby servers to connect during smart shutdown > in order to wake up backends waiting for sync rep. I think that is possibly OK, but the big problem is the lack of a clear set of comments about how the states are supposed to interact that allow these changes to be validated. That state isn't down to you, but I think we need that clear so this change is more obviously correct than it is now. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Mar 24, 2011 at 8:34 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, Mar 24, 2011 at 11:17 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Wed, Mar 23, 2011 at 5:53 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>>>> Do you still want to work up a patch for this? If so, I can review. >>> >>> Sure. Will do. >> >> The attached patch allows standby servers to connect during smart shutdown >> in order to wake up backends waiting for sync rep. > > I think that is possibly OK, but the big problem is the lack of a > clear set of comments about how the states are supposed to interact > that allow these changes to be validated. > > That state isn't down to you, but I think we need that clear so this > change is more obviously correct than it is now. src/backend/replication/README needs to be updated to make that clear? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Mar 24, 2011 at 11:53 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Thu, Mar 24, 2011 at 8:34 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On Thu, Mar 24, 2011 at 11:17 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> On Wed, Mar 23, 2011 at 5:53 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>>>>> Do you still want to work up a patch for this? If so, I can review. >>>> >>>> Sure. Will do. >>> >>> The attached patch allows standby servers to connect during smart shutdown >>> in order to wake up backends waiting for sync rep. >> >> I think that is possibly OK, but the big problem is the lack of a >> clear set of comments about how the states are supposed to interact >> that allow these changes to be validated. >> >> That state isn't down to you, but I think we need that clear so this >> change is more obviously correct than it is now. > > src/backend/replication/README needs to be updated to make that clear? Not sure where we'd put them. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs wrote: > On Wed, 2011-03-09 at 21:21 -0500, Bruce Momjian wrote: > > Simon Riggs wrote: > > > On Fri, 2011-03-04 at 23:15 +0900, Fujii Masao wrote: > > > > > > > postgres=# SELECT application_name, state, sync_priority, sync_state > > > > FROM pg_stat_replication; > > > > application_name | state | sync_priority | sync_state > > > > ------------------+-----------+---------------+------------ > > > > one | STREAMING | 1 | POTENTIAL > > > > two | streaming | 2 | sync > > > > (2 rows) > > > > > > Bug! Thanks. > > > > Is there a reason these status are all upper-case? > > NOT AS FAR AS I KNOW. > > I'll add it to the list of changes for beta. The attached patch lowercases the labels displayed in the view above. (The example above was originally all upper-case.) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c new file mode 100644 index af3c95a..470e6d1 *** a/src/backend/replication/walsender.c --- b/src/backend/replication/walsender.c *************** WalSndGetStateString(WalSndState state) *** 1350,1362 **** switch (state) { case WALSNDSTATE_STARTUP: ! return "STARTUP"; case WALSNDSTATE_BACKUP: ! return "BACKUP"; case WALSNDSTATE_CATCHUP: ! return "CATCHUP"; case WALSNDSTATE_STREAMING: ! return "STREAMING"; } return "UNKNOWN"; } --- 1350,1362 ---- switch (state) { case WALSNDSTATE_STARTUP: ! return "startup"; case WALSNDSTATE_BACKUP: ! return "backup"; case WALSNDSTATE_CATCHUP: ! return "catchup"; case WALSNDSTATE_STREAMING: ! return "streaming"; } return "UNKNOWN"; } *************** pg_stat_get_wal_senders(PG_FUNCTION_ARGS *** 1501,1511 **** * informational, not different from priority. */ if (sync_priority[i] == 0) ! values[7] = CStringGetTextDatum("ASYNC"); else if (i == sync_standby) ! values[7] = CStringGetTextDatum("SYNC"); else ! values[7] = CStringGetTextDatum("POTENTIAL"); } tuplestore_putvalues(tupstore, tupdesc, values, nulls); --- 1501,1511 ---- * informational, not different from priority. */ if (sync_priority[i] == 0) ! values[7] = CStringGetTextDatum("async"); else if (i == sync_standby) ! values[7] = CStringGetTextDatum("sync"); else ! values[7] = CStringGetTextDatum("potential"); } tuplestore_putvalues(tupstore, tupdesc, values, nulls);
Bruce Momjian wrote: > Simon Riggs wrote: > > On Wed, 2011-03-09 at 21:21 -0500, Bruce Momjian wrote: > > > Simon Riggs wrote: > > > > On Fri, 2011-03-04 at 23:15 +0900, Fujii Masao wrote: > > > > > > > > > postgres=# SELECT application_name, state, sync_priority, sync_state > > > > > FROM pg_stat_replication; > > > > > application_name | state | sync_priority | sync_state > > > > > ------------------+-----------+---------------+------------ > > > > > one | STREAMING | 1 | POTENTIAL > > > > > two | streaming | 2 | sync > > > > > (2 rows) > > > > > > > > Bug! Thanks. > > > > > > Is there a reason these status are all upper-case? > > > > NOT AS FAR AS I KNOW. > > > > I'll add it to the list of changes for beta. > > The attached patch lowercases the labels displayed in the view above. > (The example above was originally all upper-case.) Applied. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +