Thread: time-delayed standbys
While I was out at the MySQL conference last week, I heard that one of the forthcoming MySQL features is "time-delayed replication": http://forge.mysql.com/worklog/task.php?id=344 That is, a standby configured such that replay lags a prescribed amount of time behind the master. This seemed easy to implement, so I did. Patch (for 9.2, obviously) attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wed, Apr 20, 2011 at 3:47 AM, Robert Haas <robertmhaas@gmail.com> wrote: > While I was out at the MySQL conference last week, I heard that one of > the forthcoming MySQL features is "time-delayed replication": Incidentally, this is a popular Oracle feature. It's a poor man's "flashback" and similar to how some filesystems automatically create regular snapshots of every home directory so you can get back stuff at some arbitrary point in the past. I haven't read the patch but are you delaying delivering the log or delaying replaying it? I think you actually want the latter so in case of a real failure you can choose between replaying the last 5 minutes and recovering everything or intentionally dropping that history if the failure was caused by an application problem. -- greg
Hello yesterday I was asked about this feature +1 Regards Pavel 2011/4/20 Robert Haas <robertmhaas@gmail.com>: > While I was out at the MySQL conference last week, I heard that one of > the forthcoming MySQL features is "time-delayed replication": > > http://forge.mysql.com/worklog/task.php?id=344 > > That is, a standby configured such that replay lags a prescribed > amount of time behind the master. > > This seemed easy to implement, so I did. Patch (for 9.2, obviously) attached. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > >
On Wed, Apr 20, 2011 at 12:24 PM, Greg Stark <gsstark@mit.edu> wrote: > I haven't read the patch but are you delaying delivering the log or > delaying replaying it? I think you actually want the latter so in case > of a real failure you can choose between replaying the last 5 minutes > and recovering everything or intentionally dropping that history if > the failure was caused by an application problem. > Ok, so I read the patch. That wasn't so hard. You did implement it the way I described so great. However you did it by comparing the server timestamp and the slave timestamp. I think we should avoid that. Synchronizing clocks is a surprisingly hard problem and it would really be great if the whole system kept working if the clocks were out of sync. I don't think that's impossible, though it's a bit inconvenient. We would need to keep track of the most recent timestamp received from the master and pause recovery if we reach recovery on a record that's less than five minutes older than that record. This means we're comparing only master timestamps to each other. It does mean if the log transfer lags then recovery will lag unnecessarily. And it means the process of receiving logs is complicated by having to keep track of this timestamp. -- greg
Greg Stark <gsstark@mit.edu> writes: > On Wed, Apr 20, 2011 at 3:47 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >> While I was out at the MySQL conference last week, I heard that one of >> the forthcoming MySQL features is "time-delayed replication": > > Incidentally, this is a popular Oracle feature. It's a poor man's > "flashback" and similar to how some filesystems automatically create > regular snapshots of every home directory so you can get back stuff at > some arbitrary point in the past. Yup. One of the big bosses where I work went asking for this a couple years ago. We're multi-platform; Oracle, MySQL, EnterpriseDB, Vertica. They put a 6 hour delay on the critical Oracle boxes. NOthing was done for MySQL or Vertica since no feature support. My C-foo being rusty, I elected to add more hacks to our home-spun PYthon version of pg_standby rather than adding the feature to pg_standby itself. Been running with delayed WAL apply ever since.. Shipping happens immediatly on log creation at master and we simply wait for the files to age the configured time before application. In a few cases, we have 2 or more standbys off the same prod master. One in real-time mode and the others lagged. Thanks all! -- Jerry Sievers Postgres DBA/Development Consulting e: gsievers19@comcast.net p: 305.321.1144
On Wed, Apr 20, 2011 at 7:42 AM, Greg Stark <gsstark@mit.edu> wrote: > On Wed, Apr 20, 2011 at 12:24 PM, Greg Stark <gsstark@mit.edu> wrote: >> I haven't read the patch but are you delaying delivering the log or >> delaying replaying it? I think you actually want the latter so in case >> of a real failure you can choose between replaying the last 5 minutes >> and recovering everything or intentionally dropping that history if >> the failure was caused by an application problem. >> > > Ok, so I read the patch. That wasn't so hard. > > You did implement it the way I described so great. > > However you did it by comparing the server timestamp and the slave > timestamp. I think we should avoid that. Synchronizing clocks is a > surprisingly hard problem and it would really be great if the whole > system kept working if the clocks were out of sync. > > I don't think that's impossible, though it's a bit inconvenient. We > would need to keep track of the most recent timestamp received from > the master and pause recovery if we reach recovery on a record that's > less than five minutes older than that record. This means we're > comparing only master timestamps to each other. It does mean if the > log transfer lags then recovery will lag unnecessarily. And it means > the process of receiving logs is complicated by having to keep track > of this timestamp. How would we keep track of the most recent timestamp received from the master without replaying the WAL records? What happens if the connection to the master is lost? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 20, 2011 at 2:19 PM, Robert Haas <robertmhaas@gmail.com> wrote: > How would we keep track of the most recent timestamp received from the > master without replaying the WAL records? Well as we receive them we would have to peek at them to see the time. Or we could have the master send its time to the slave as I believe Tom has already proposed in the past but I believe didn't choose to do? I haven't really been following all the twists and turns of streaming replication and sync replication. I read the e-mails but fell behind a few times so I'm not sure what the end result was sometimes. > What happens if the connection to the master is lost? We're writing these records to disk no? I don't understand the problem. It doesn't even matter if the records get lost and have to be retrieved again, as long as we gate the recovery of the records based on a time we know the master has seen sometime. Oh, I think I see what you're getting at. We would never replay the last five minutes if there's no connection. We definitely want to make sure the admin is aware of the lag and can disable this feature before bringing up the slave. -- greg
Greg Stark <gsstark@mit.edu> writes: > On Wed, Apr 20, 2011 at 2:19 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> How would we keep track of the most recent timestamp received from the >> master without replaying the WAL records? > Well as we receive them we would have to peek at them to see the time. > Or we could have the master send its time to the slave as I believe > Tom has already proposed in the past but I believe didn't choose to > do? My idea of how to manage it would be to have walreceiver explicitly track the clock difference from the master, which it can do since walsender puts its current time into every message header. You can use the slave's clock for comparisons so long as you add the appropriate offset. (The theory here is that the skew won't change very fast ...) regards, tom lane
On Wed, Apr 20, 2011 at 3:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > My idea of how to manage it would be to have walreceiver explicitly > track the clock difference from the master, which it can do since > walsender puts its current time into every message header. You can use > the slave's clock for comparisons so long as you add the appropriate > offset. (The theory here is that the skew won't change very fast ...) > Ah, so we did put the master's clock in every message? Then this should be simple, no? Just compare the master's timestamp from the record to the last master's clock seen in the messages. That sounds equivalent but a lot safer than trying to keep a conversion between them. -- greg
Greg Stark <gsstark@mit.edu> writes: > Ah, so we did put the master's clock in every message? Yes, we did. > Then this > should be simple, no? Just compare the master's timestamp from the > record to the last master's clock seen in the messages. That sounds > equivalent but a lot safer than trying to keep a conversion between > them. Well, the question is what happens after you stop receiving master messages. If you don't make use of the slave's clock somehow, application of WAL will stop dead in the water, which seems unlikely to be what's wanted. regards, tom lane
On Wed, Apr 20, 2011 at 10:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <gsstark@mit.edu> writes: >> On Wed, Apr 20, 2011 at 2:19 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> How would we keep track of the most recent timestamp received from the >>> master without replaying the WAL records? > >> Well as we receive them we would have to peek at them to see the time. >> Or we could have the master send its time to the slave as I believe >> Tom has already proposed in the past but I believe didn't choose to >> do? > > My idea of how to manage it would be to have walreceiver explicitly > track the clock difference from the master, which it can do since > walsender puts its current time into every message header. You can use > the slave's clock for comparisons so long as you add the appropriate > offset. (The theory here is that the skew won't change very fast ...) I am a bit concerned about the reliability of this approach. If there is some network lag, or some lag in processing from the master, we could easily get the idea that there is time skew between the machines when there really isn't. And our perception of the time skew could easily bounce around from message to message, as the lag varies. I think it would be tremendously ironic of the two machines were actually synchronized to the microsecond, but by trying to be clever about it we managed to make the lag-time accurate only to within several seconds. It looks like the MySQL version of this feature is spec'd to have some sort of time skew compensation built into it, but I don't see anything that explains how it actually works. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I am a bit concerned about the reliability of this approach. If there > is some network lag, or some lag in processing from the master, we > could easily get the idea that there is time skew between the machines > when there really isn't. And our perception of the time skew could > easily bounce around from message to message, as the lag varies. I > think it would be tremendously ironic of the two machines were > actually synchronized to the microsecond, but by trying to be clever > about it we managed to make the lag-time accurate only to within > several seconds. Well, if walreceiver concludes that there is no more than a few seconds' difference between the clocks, it'd probably be OK to take the master timestamps at face value. The problem comes when the skew gets large (compared to the configured time delay, I guess). regards, tom lane
On Wed, Apr 20, 2011 at 3:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <gsstark@mit.edu> writes: >> Ah, so we did put the master's clock in every message? > > Yes, we did. And by "we" I mean "you".... I realize I'm tossing in comments from the peanut gallery to you and especially Robert who worked on this stuff a lot already. >> Then this >> should be simple, no? Just compare the master's timestamp from the >> record to the last master's clock seen in the messages. That sounds >> equivalent but a lot safer than trying to keep a conversion between >> them. > > Well, the question is what happens after you stop receiving master > messages. If you don't make use of the slave's clock somehow, > application of WAL will stop dead in the water, which seems unlikely > to be what's wanted. I'm not convinced that's so bad. But even so the logic could be: wait until (master.last_time_seen > this_record.master-timestamp+n minutes || gettimeofday() > this_record.local_time_when_received+nminutes) -- greg
On Wed, Apr 20, 2011 at 11:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I am a bit concerned about the reliability of this approach. If there >> is some network lag, or some lag in processing from the master, we >> could easily get the idea that there is time skew between the machines >> when there really isn't. And our perception of the time skew could >> easily bounce around from message to message, as the lag varies. I >> think it would be tremendously ironic of the two machines were >> actually synchronized to the microsecond, but by trying to be clever >> about it we managed to make the lag-time accurate only to within >> several seconds. > > Well, if walreceiver concludes that there is no more than a few seconds' > difference between the clocks, it'd probably be OK to take the master > timestamps at face value. The problem comes when the skew gets large > (compared to the configured time delay, I guess). I suppose. Any bound on how much lag there can be before we start applying to skew correction is going to be fairly arbitrary. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Apr 19, 2011 at 9:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > That is, a standby configured such that replay lags a prescribed > amount of time behind the master. > > This seemed easy to implement, so I did. Patch (for 9.2, obviously) attached. > This crashes when stoping recovery to a target (i tried with a named restore point and with a poin in time) after executing pg_xlog_replay_resume(). here is the backtrace. I will try to check later but i wanted to report it before... #0 0xb7777537 in raise () from /lib/libc.so.6 #1 0xb777a922 in abort () from /lib/libc.so.6 #2 0x08393a19 in errfinish (dummy=0) at elog.c:513 #3 0x083944ba in elog_finish (elevel=22, fmt=0x83d5221 "wal receiver still active") at elog.c:1156 #4 0x080f04cb in StartupXLOG () at xlog.c:6691 #5 0x080f2825 in StartupProcessMain () at xlog.c:10050 #6 0x0811468f in AuxiliaryProcessMain (argc=2, argv=0xbfa326a8) at bootstrap.c:417 #7 0x0827c2ea in StartChildProcess (type=StartupProcess) at postmaster.c:4488 #8 0x08280b85 in PostmasterMain (argc=3, argv=0xa4c17e8) at postmaster.c:1106 #9 0x0821730f in main (argc=3, argv=0xa4c17e8) at main.c:199 -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte y capacitación de PostgreSQL
On Sat, Apr 23, 2011 at 9:46 PM, Jaime Casanova <jaime@2ndquadrant.com> wrote: > On Tue, Apr 19, 2011 at 9:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> That is, a standby configured such that replay lags a prescribed >> amount of time behind the master. >> >> This seemed easy to implement, so I did. Patch (for 9.2, obviously) attached. >> > > This crashes when stoping recovery to a target (i tried with a named > restore point and with a poin in time) after executing > pg_xlog_replay_resume(). here is the backtrace. I will try to check > later but i wanted to report it before... > > #0 0xb7777537 in raise () from /lib/libc.so.6 > #1 0xb777a922 in abort () from /lib/libc.so.6 > #2 0x08393a19 in errfinish (dummy=0) at elog.c:513 > #3 0x083944ba in elog_finish (elevel=22, fmt=0x83d5221 "wal receiver > still active") at elog.c:1156 > #4 0x080f04cb in StartupXLOG () at xlog.c:6691 > #5 0x080f2825 in StartupProcessMain () at xlog.c:10050 > #6 0x0811468f in AuxiliaryProcessMain (argc=2, argv=0xbfa326a8) at > bootstrap.c:417 > #7 0x0827c2ea in StartChildProcess (type=StartupProcess) at postmaster.c:4488 > #8 0x08280b85 in PostmasterMain (argc=3, argv=0xa4c17e8) at postmaster.c:1106 > #9 0x0821730f in main (argc=3, argv=0xa4c17e8) at main.c:199 Sorry for the slow response on this - I was on vacation for a week and my schedule got a big hole in it. I was able to reproduce something very like this in unpatched master, just by letting recovery pause at a named restore point, and then resuming it. LOG: recovery stopping at restore point "stop", time 2011-05-07 09:28:01.652958-04 LOG: recovery has paused HINT: Execute pg_xlog_replay_resume() to continue. (at this point I did pg_xlog_replay_resume()) LOG: redo done at 0/5000020 PANIC: wal receiver still active LOG: startup process (PID 38762) was terminated by signal 6: Abort trap LOG: terminating any other active server processes I'm thinking that this code is wrong: if (recoveryPauseAtTarget && standbyState == STANDBY_SNAPSHOT_READY) { SetRecoveryPause(true); recoveryPausesHere(); } reachedStopPoint = true; /* see below */ recoveryContinue = false; I think that recoveryContinue = false assignment should not happen if we decide to pause. That is, we should say if (recoveryPauseAtTarget && standbyState == STANDBY_SNAPSHOT_READY) { same as now } else recoveryContinue = false. I haven't tested that, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, May 7, 2011 at 10:48 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I was able to reproduce something very like this in unpatched master, > just by letting recovery pause at a named restore point, and then > resuming it. I was able to reproduce the same problem even in 9.0. When the standby reaches the recovery target, it always tries to end the recovery even though walreceiver is still running, which causes the problem. This seems to be an oversight in streaming replication. I should have considered how the standby should work when recovery_target is specified. What about the attached patch? Which stops walreceiver instead of emitting PANIC there only if we've reached the recovery target. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
On 07.05.2011 16:48, Robert Haas wrote: > I was able to reproduce something very like this in unpatched master, > just by letting recovery pause at a named restore point, and then > resuming it. > > LOG: recovery stopping at restore point "stop", time 2011-05-07 > 09:28:01.652958-04 > LOG: recovery has paused > HINT: Execute pg_xlog_replay_resume() to continue. > (at this point I did pg_xlog_replay_resume()) > LOG: redo done at 0/5000020 > PANIC: wal receiver still active > LOG: startup process (PID 38762) was terminated by signal 6: Abort trap > LOG: terminating any other active server processes > > I'm thinking that this code is wrong: > > if (recoveryPauseAtTarget&& standbyState == > STANDBY_SNAPSHOT_READY) > { > SetRecoveryPause(true); > recoveryPausesHere(); > } > reachedStopPoint = true; /* see below */ > recoveryContinue = false; > > I think that recoveryContinue = false assignment should not happen if > we decide to pause. That is, we should say if (recoveryPauseAtTarget > && standbyState == STANDBY_SNAPSHOT_READY) { same as now } else > recoveryContinue = false. No, recovery stops at that point whether or not you pause. Resuming after stopping at the recovery target doesn't mean that you resume recovery, it means that you resume to end recovery and start up the server (see the 2nd to last paragraph at http://www.postgresql.org/docs/9.1/static/recovery-target-settings.html). It would probably be more useful to allow a new stopping target to be set and continue recovery, but the current pause/resume functions don't allow that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 11.05.2011 08:29, Fujii Masao wrote: > On Sat, May 7, 2011 at 10:48 PM, Robert Haas<robertmhaas@gmail.com> wrote: >> I was able to reproduce something very like this in unpatched master, >> just by letting recovery pause at a named restore point, and then >> resuming it. > > I was able to reproduce the same problem even in 9.0. When the standby > reaches the recovery target, it always tries to end the recovery even > though walreceiver is still running, which causes the problem. This seems > to be an oversight in streaming replication. I should have considered how > the standby should work when recovery_target is specified. > > What about the attached patch? Which stops walreceiver instead of > emitting PANIC there only if we've reached the recovery target. I think we can just always call ShutdownWalRcv(). It should be gone if the server was promoted while streaming, but that's just an implementation detail of what the promotion code does. There's no hard reason why it shouldn't be running at that point anymore, as long as we kill it before going any further. Committed a patch to do that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, May 11, 2011 at 6:50 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > I think we can just always call ShutdownWalRcv(). It should be gone if the > server was promoted while streaming, but that's just an implementation > detail of what the promotion code does. There's no hard reason why it > shouldn't be running at that point anymore, as long as we kill it before > going any further. Okay. But I'd like to add the following assertion check just before ShutdownWalRcv() which you added, in order to detect such a bug that we found this time, i.e., the bug which causes unexpected end of recovery. Thought? Assert(reachedStopPoint || !WalRcvInProgress()) > Committed a patch to do that. Thanks. Should we backport it to 9.0? 9.0 has the same problem. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 11.05.2011 14:16, Fujii Masao wrote: > On Wed, May 11, 2011 at 6:50 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> I think we can just always call ShutdownWalRcv(). It should be gone if the >> server was promoted while streaming, but that's just an implementation >> detail of what the promotion code does. There's no hard reason why it >> shouldn't be running at that point anymore, as long as we kill it before >> going any further. > > Okay. But I'd like to add the following assertion check just before > ShutdownWalRcv() which you added, in order to detect such a bug > that we found this time, i.e., the bug which causes unexpected end > of recovery. Thought? > > Assert(reachedStopPoint || !WalRcvInProgress()) There's no unexpected end of recovery here. The recovery ends when we reach the target, as it should. It was the assumption that WAL receiver can't be running at that point anymore that was wrong. That assertion would work, AFAICS, but I don't think it's something we need to assert. There isn't any harm done if WAL receiver is still running, as long as we shut it down at that point. >> Committed a patch to do that. > > Thanks. Should we backport it to 9.0? 9.0 has the same problem. Ah, thanks, missed that, Cherry-picked to 9.0 now as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, Apr 21, 2011 at 12:18 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Apr 20, 2011 at 11:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> I am a bit concerned about the reliability of this approach. If there >>> is some network lag, or some lag in processing from the master, we >>> could easily get the idea that there is time skew between the machines >>> when there really isn't. And our perception of the time skew could >>> easily bounce around from message to message, as the lag varies. I >>> think it would be tremendously ironic of the two machines were >>> actually synchronized to the microsecond, but by trying to be clever >>> about it we managed to make the lag-time accurate only to within >>> several seconds. >> >> Well, if walreceiver concludes that there is no more than a few seconds' >> difference between the clocks, it'd probably be OK to take the master >> timestamps at face value. The problem comes when the skew gets large >> (compared to the configured time delay, I guess). > > I suppose. Any bound on how much lag there can be before we start > applying to skew correction is going to be fairly arbitrary. When the replication connection is terminated, the standby tries to read WAL files from the archive. In this case, there is no walreceiver process, so how does the standby calculate the clock difference? > errmsg("parameter \"%s\" requires a temporal value", "recovery_time_delay"), We should s/"a temporal"/"an Integer"? After we run "pg_ctl promote", time-delayed replication should be disabled? Otherwise, failover might take very long time when we set recovery_time_delay to high value. http://forge.mysql.com/worklog/task.php?id=344 According to the above page, one purpose of time-delayed replication is to protect against user mistakes on master. But, when an user notices his wrong operation on master, what should he do next? The WAL records of his wrong operation might have already arrived at the standby, so neither "promote" nor "restart" doesn't cancel that wrong operation. Instead, probably he should shutdown the standby, investigate the timestamp of XID of the operation he'd like to cancel, set recovery_target_time and restart the standby. Something like this procedures should be documented? Or, we should implement new "promote" mode which finishes a recovery as soon as "promote" is requested (i.e., not replay all the available WAL records)? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Jun 15, 2011 at 12:58 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > > http://forge.mysql.com/worklog/task.php?id=344 > According to the above page, one purpose of time-delayed replication is to > protect against user mistakes on master. But, when an user notices his wrong > operation on master, what should he do next? The WAL records of his wrong > operation might have already arrived at the standby, so neither "promote" nor > "restart" doesn't cancel that wrong operation. Instead, probably he should > shutdown the standby, investigate the timestamp of XID of the operation > he'd like to cancel, set recovery_target_time and restart the standby. > Something like this procedures should be documented? Or, we should > implement new "promote" mode which finishes a recovery as soon as > "promote" is requested (i.e., not replay all the available WAL records)? > i would prefer something like "pg_ctl promote -m immediate" that terminates the recovery -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación
On Wed, Jun 15, 2011 at 1:58 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > When the replication connection is terminated, the standby tries to read > WAL files from the archive. In this case, there is no walreceiver process, > so how does the standby calculate the clock difference? Good question. Also, just because we have streaming replication available doesn't mean that we should force people to use it. It's still perfectly legit to set up a standby that only use archive_command and restore_command, and it would be nice if this feature could still work in such an environment. I anticipate that most people want to use streaming replication, but a time-delayed standby is a good example of a case where you might decide you don't need it. It could be useful to have all the WAL present (but not yet applied) if you're thinking you might want to promote that standby - but my guess is that in many cases, the time-delayed standby will be *in addition* to one or more regular standbys that would be the primary promotion candidates. So I can see someone deciding that they'd rather not have the load of another walsender on the master, and just let the time-delayed standby read from the archive. Even if that were not an issue, I'm still more or less of the opinion that trying to solve the time synchronization problem is a rathole anyway. To really solve this problem well, you're going to need the standby to send a message containing a timestamp, get a reply back from the master that contains that timestamp and a master timestamp, and then compute based on those two timestamps plus the reply timestamp the maximum and minimum possible lag between the two machines. Then you're going to need to guess, based on several cycles of this activity, what the actual lag is, and adjust it over time (but not too quckly, unless of course a large manual step has occurred) as the clocks potentially drift apart from each other. This is basically what ntpd does, except that it can be virtually guaranteed that our implementation will suck by comparison. Time synchronization is neither easy nor our core competency, and I think trying to include it in this feature is going to result in a net loss of reliability. >> errmsg("parameter \"%s\" requires a temporal value", "recovery_time_delay"), > > We should s/"a temporal"/"an Integer"? It seems strange to ask for an integer when what we want is an amount of time in seconds or minutes... > After we run "pg_ctl promote", time-delayed replication should be disabled? > Otherwise, failover might take very long time when we set recovery_time_delay > to high value. Yeah, I think so. > http://forge.mysql.com/worklog/task.php?id=344 > According to the above page, one purpose of time-delayed replication is to > protect against user mistakes on master. But, when an user notices his wrong > operation on master, what should he do next? The WAL records of his wrong > operation might have already arrived at the standby, so neither "promote" nor > "restart" doesn't cancel that wrong operation. Instead, probably he should > shutdown the standby, investigate the timestamp of XID of the operation > he'd like to cancel, set recovery_target_time and restart the standby. > Something like this procedures should be documented? Or, we should > implement new "promote" mode which finishes a recovery as soon as > "promote" is requested (i.e., not replay all the available WAL records)? I like the idea of a new promote mode; and documenting the other approach you mention doesn't sound bad either. Either one sounds like a job for a separate patch, though. The other option is to pause recovery and run pg_dump... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jun 17, 2011 at 3:29 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Even if that were not an issue, I'm still more or less of the opinion > that trying to solve the time synchronization problem is a rathole > anyway. To really solve this problem well, you're going to need the > standby to send a message containing a timestamp, get a reply back > from the master that contains that timestamp and a master timestamp, > and then compute based on those two timestamps plus the reply > timestamp the maximum and minimum possible lag between the two > machines. Then you're going to need to guess, based on several cycles > of this activity, what the actual lag is, and adjust it over time (but > not too quckly, unless of course a large manual step has occurred) as > the clocks potentially drift apart from each other. This is basically > what ntpd does, except that it can be virtually guaranteed that our > implementation will suck by comparison. Time synchronization is > neither easy nor our core competency, and I think trying to include it > in this feature is going to result in a net loss of reliability. Agreed. You've already added the note about time synchronization into the document. That's enough, I think. >>> errmsg("parameter \"%s\" requires a temporal value", "recovery_time_delay"), >> >> We should s/"a temporal"/"an Integer"? > > It seems strange to ask for an integer when what we want is an amount > of time in seconds or minutes... OK. >> http://forge.mysql.com/worklog/task.php?id=344 >> According to the above page, one purpose of time-delayed replication is to >> protect against user mistakes on master. But, when an user notices his wrong >> operation on master, what should he do next? The WAL records of his wrong >> operation might have already arrived at the standby, so neither "promote" nor >> "restart" doesn't cancel that wrong operation. Instead, probably he should >> shutdown the standby, investigate the timestamp of XID of the operation >> he'd like to cancel, set recovery_target_time and restart the standby. >> Something like this procedures should be documented? Or, we should >> implement new "promote" mode which finishes a recovery as soon as >> "promote" is requested (i.e., not replay all the available WAL records)? > > I like the idea of a new promote mode; Are you going to implement that mode in this CF? or next one? > and documenting the other > approach you mention doesn't sound bad either. Either one sounds like > a job for a separate patch, though. > > The other option is to pause recovery and run pg_dump... Yes, please. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Jun 16, 2011 at 10:10 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> According to the above page, one purpose of time-delayed replication is to >>> protect against user mistakes on master. But, when an user notices his wrong >>> operation on master, what should he do next? The WAL records of his wrong >>> operation might have already arrived at the standby, so neither "promote" nor >>> "restart" doesn't cancel that wrong operation. Instead, probably he should >>> shutdown the standby, investigate the timestamp of XID of the operation >>> he'd like to cancel, set recovery_target_time and restart the standby. >>> Something like this procedures should be documented? Or, we should >>> implement new "promote" mode which finishes a recovery as soon as >>> "promote" is requested (i.e., not replay all the available WAL records)? >> >> I like the idea of a new promote mode; > > Are you going to implement that mode in this CF? or next one? I wasn't really planning on it - I thought you might want to take a crack at it. The feature is usable without that, just maybe a bit less cool. Certainly, it's too late for any more formal submissions to this CF, but I wouldn't mind reviewing a patch if you want to write one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jun 17, 2011 at 11:34 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jun 16, 2011 at 10:10 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >>>> According to the above page, one purpose of time-delayed replication is to >>>> protect against user mistakes on master. But, when an user notices his wrong >>>> operation on master, what should he do next? The WAL records of his wrong >>>> operation might have already arrived at the standby, so neither "promote" nor >>>> "restart" doesn't cancel that wrong operation. Instead, probably he should >>>> shutdown the standby, investigate the timestamp of XID of the operation >>>> he'd like to cancel, set recovery_target_time and restart the standby. >>>> Something like this procedures should be documented? Or, we should >>>> implement new "promote" mode which finishes a recovery as soon as >>>> "promote" is requested (i.e., not replay all the available WAL records)? >>> >>> I like the idea of a new promote mode; >> >> Are you going to implement that mode in this CF? or next one? > > I wasn't really planning on it - I thought you might want to take a > crack at it. The feature is usable without that, just maybe a bit > less cool. Right. > Certainly, it's too late for any more formal submissions > to this CF, but I wouldn't mind reviewing a patch if you want to write > one. Okay, I add that into my TODO list. But I might not have enough time to develop that. So, everyone, please feel free to implement that if you want! Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Jun 15, 2011 at 1:58 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > After we run "pg_ctl promote", time-delayed replication should be disabled? > Otherwise, failover might take very long time when we set recovery_time_delay > to high value. PFA a patch that I believe will disable recovery_time_delay after promotion. The only change from the previous version is: diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog index 1dbf792..41b3ae9 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5869,7 +5869,7 @@ pg_is_xlog_replay_paused(PG_FUNCTION_ARGS) static void recoveryDelay(void) { - while (1) + while (!CheckForStandbyTrigger()) { long secs; int microsecs; -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wed, Jun 15, 2011 at 6:58 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > Or, we should > implement new "promote" mode which finishes a recovery as soon as > "promote" is requested (i.e., not replay all the available WAL records)? That's not a new feature. We had it in 8.4, but it was removed. Originally, we supported "fast" failover via trigger file. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jun 16, 2011 at 7:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jun 15, 2011 at 1:58 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> When the replication connection is terminated, the standby tries to read >> WAL files from the archive. In this case, there is no walreceiver process, >> so how does the standby calculate the clock difference? > > Good question. Also, just because we have streaming replication > available doesn't mean that we should force people to use it. It's > still perfectly legit to set up a standby that only use > archive_command and restore_command, and it would be nice if this > feature could still work in such an environment. I anticipate that > most people want to use streaming replication, but a time-delayed > standby is a good example of a case where you might decide you don't > need it. It could be useful to have all the WAL present (but not yet > applied) if you're thinking you might want to promote that standby - > but my guess is that in many cases, the time-delayed standby will be > *in addition* to one or more regular standbys that would be the > primary promotion candidates. So I can see someone deciding that > they'd rather not have the load of another walsender on the master, > and just let the time-delayed standby read from the archive. > > Even if that were not an issue, I'm still more or less of the opinion > that trying to solve the time synchronization problem is a rathole > anyway. To really solve this problem well, you're going to need the > standby to send a message containing a timestamp, get a reply back > from the master that contains that timestamp and a master timestamp, > and then compute based on those two timestamps plus the reply > timestamp the maximum and minimum possible lag between the two > machines. Then you're going to need to guess, based on several cycles > of this activity, what the actual lag is, and adjust it over time (but > not too quckly, unless of course a large manual step has occurred) as > the clocks potentially drift apart from each other. This is basically > what ntpd does, except that it can be virtually guaranteed that our > implementation will suck by comparison. Time synchronization is > neither easy nor our core competency, and I think trying to include it > in this feature is going to result in a net loss of reliability. This begs the question of why we need this feature at all, in the way proposed. Streaming replication is designed for immediate transfer of WAL. File based is more about storing them for some later use. It seems strange to pollute the *immediate* transfer route with a delay, when that is easily possible with a small patch to pg_standby that can wait until the filetime delay is > X before returning. The main practical problem with this is that most people's WAL partitions aren't big enough to store the delayed WAL files, which is why we provide the file archiving route anyway. So in practical terms this will be unusable, or at least dangerous to use. +1 for the feature concept, but -1 for adding this to streaming replication. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jun 29, 2011 at 4:00 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, Jun 16, 2011 at 7:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Jun 15, 2011 at 1:58 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> When the replication connection is terminated, the standby tries to read >>> WAL files from the archive. In this case, there is no walreceiver process, >>> so how does the standby calculate the clock difference? >> >> Good question. Also, just because we have streaming replication >> available doesn't mean that we should force people to use it. It's >> still perfectly legit to set up a standby that only use >> archive_command and restore_command, and it would be nice if this >> feature could still work in such an environment. I anticipate that >> most people want to use streaming replication, but a time-delayed >> standby is a good example of a case where you might decide you don't >> need it. It could be useful to have all the WAL present (but not yet >> applied) if you're thinking you might want to promote that standby - >> but my guess is that in many cases, the time-delayed standby will be >> *in addition* to one or more regular standbys that would be the >> primary promotion candidates. So I can see someone deciding that >> they'd rather not have the load of another walsender on the master, >> and just let the time-delayed standby read from the archive. >> >> Even if that were not an issue, I'm still more or less of the opinion >> that trying to solve the time synchronization problem is a rathole >> anyway. To really solve this problem well, you're going to need the >> standby to send a message containing a timestamp, get a reply back >> from the master that contains that timestamp and a master timestamp, >> and then compute based on those two timestamps plus the reply >> timestamp the maximum and minimum possible lag between the two >> machines. Then you're going to need to guess, based on several cycles >> of this activity, what the actual lag is, and adjust it over time (but >> not too quckly, unless of course a large manual step has occurred) as >> the clocks potentially drift apart from each other. This is basically >> what ntpd does, except that it can be virtually guaranteed that our >> implementation will suck by comparison. Time synchronization is >> neither easy nor our core competency, and I think trying to include it >> in this feature is going to result in a net loss of reliability. > > > This begs the question of why we need this feature at all, in the way proposed. > > Streaming replication is designed for immediate transfer of WAL. File > based is more about storing them for some later use. > > It seems strange to pollute the *immediate* transfer route with a > delay, when that is easily possible with a small patch to pg_standby > that can wait until the filetime delay is > X before returning. > > The main practical problem with this is that most people's WAL > partitions aren't big enough to store the delayed WAL files, which is > why we provide the file archiving route anyway. So in practical terms > this will be unusable, or at least dangerous to use. > > +1 for the feature concept, but -1 for adding this to streaming replication. As implemented, the feature will work with either streaming replication or with file-based replication. I don't see any value in restricting to work ONLY with file-based replication. Also, if we were to do it by making pg_standby wait, then the whole thing would be much less accurate, and the delay would become much harder to predict, because you'd be operating on the level of entire WAL segments, rather than individual commit records. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 29, 2011 at 1:24 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jun 29, 2011 at 4:00 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On Thu, Jun 16, 2011 at 7:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Wed, Jun 15, 2011 at 1:58 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>>> When the replication connection is terminated, the standby tries to read >>>> WAL files from the archive. In this case, there is no walreceiver process, >>>> so how does the standby calculate the clock difference? >>> >>> Good question. Also, just because we have streaming replication >>> available doesn't mean that we should force people to use it. It's >>> still perfectly legit to set up a standby that only use >>> archive_command and restore_command, and it would be nice if this >>> feature could still work in such an environment. I anticipate that >>> most people want to use streaming replication, but a time-delayed >>> standby is a good example of a case where you might decide you don't >>> need it. It could be useful to have all the WAL present (but not yet >>> applied) if you're thinking you might want to promote that standby - >>> but my guess is that in many cases, the time-delayed standby will be >>> *in addition* to one or more regular standbys that would be the >>> primary promotion candidates. So I can see someone deciding that >>> they'd rather not have the load of another walsender on the master, >>> and just let the time-delayed standby read from the archive. >>> >>> Even if that were not an issue, I'm still more or less of the opinion >>> that trying to solve the time synchronization problem is a rathole >>> anyway. To really solve this problem well, you're going to need the >>> standby to send a message containing a timestamp, get a reply back >>> from the master that contains that timestamp and a master timestamp, >>> and then compute based on those two timestamps plus the reply >>> timestamp the maximum and minimum possible lag between the two >>> machines. Then you're going to need to guess, based on several cycles >>> of this activity, what the actual lag is, and adjust it over time (but >>> not too quckly, unless of course a large manual step has occurred) as >>> the clocks potentially drift apart from each other. This is basically >>> what ntpd does, except that it can be virtually guaranteed that our >>> implementation will suck by comparison. Time synchronization is >>> neither easy nor our core competency, and I think trying to include it >>> in this feature is going to result in a net loss of reliability. >> >> >> This begs the question of why we need this feature at all, in the way proposed. >> >> Streaming replication is designed for immediate transfer of WAL. File >> based is more about storing them for some later use. >> >> It seems strange to pollute the *immediate* transfer route with a >> delay, when that is easily possible with a small patch to pg_standby >> that can wait until the filetime delay is > X before returning. >> >> The main practical problem with this is that most people's WAL >> partitions aren't big enough to store the delayed WAL files, which is >> why we provide the file archiving route anyway. So in practical terms >> this will be unusable, or at least dangerous to use. >> >> +1 for the feature concept, but -1 for adding this to streaming replication. > > As implemented, the feature will work with either streaming > replication or with file-based replication. That sounds like the exact opposite of yours and Fujii's comments above. Please explain. > I don't see any value in > restricting to work ONLY with file-based replication. As explained above, it won't work in practice because of the amount of file space required. Or, an alternative question: what will you do when it waits so long that the standby runs out of disk space? If you hard-enforce the time delay specified then you just make replication fail under during heavy loads. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jun 29, 2011 at 1:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> As implemented, the feature will work with either streaming >> replication or with file-based replication. > > That sounds like the exact opposite of yours and Fujii's comments > above. Please explain. I think our comments above were addressing the issue of whether it's feasible to correct for time skew between the master and the slave. Tom was arguing that we should try, but I was arguing that any system we put together is likely to be pretty unreliable (since good time synchronization algorithms are quite complex, and to my knowledge no one here is an expert on implementing them, nor do I think we want that much complexity in the backend) and Fujii was pointing out that it won't work at all if the WAL files are going through the archive rather than through streaming replication, which (if I understand you correctly) will be a more common case than I had assumed. >> I don't see any value in >> restricting to work ONLY with file-based replication. > > As explained above, it won't work in practice because of the amount of > file space required. I guess it depends on how busy your system is and how much disk space you have. If using streaming replication causes pg_xlog to fill up on your standby, then you can either (1) put pg_xlog on a larger file system or (2) configure only restore_command and not primary_conninfo, so that only the archive is used. > Or, an alternative question: what will you do when it waits so long > that the standby runs out of disk space? I don't really see how that's any different from what happens now. If (for whatever reason) the master is generating WAL faster than a streaming standby can replay it, then the excess WAL is going to pile up someplace, and you might run out of disk space. Time-delaying the standby creates an additional way for that to happen, but I don't think it's an entirely new problem. I am not sure exactly how walreceiver handles it if the disk is full. I assume it craps out and eventually retries, so probably what will happen is that, after the standby's pg_xlog directory fills up, walreceiver will sit there and error out until replay advances enough to remove a WAL file and thus permit some more data to be streamed. If the standby gets far enough behind the master that the required files are no longer there, then it will switch to the archive, if available. It might be nice to have a mode that only allows streaming replication when the amount of disk space on the standby is greater than or equal to some threshold, but that seems like a topic for another patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert, > I don't really see how that's any different from what happens now. If > (for whatever reason) the master is generating WAL faster than a > streaming standby can replay it, then the excess WAL is going to pile > up someplace, and you might run out of disk space. Time-delaying the > standby creates an additional way for that to happen, but I don't > think it's an entirely new problem. Not remotely new. xlog partition full is currently 75% of the emergency support calls PGX gets from clients on 9.0 (if only they'd pay attention to their nagios alerts!) > I am not sure exactly how walreceiver handles it if the disk is full. > I assume it craps out and eventually retries, so probably what will > happen is that, after the standby's pg_xlog directory fills up, > walreceiver will sit there and error out until replay advances enough > to remove a WAL file and thus permit some more data to be streamed. Nope, it gets stuck and stops there. Replay doesn't advance unless you can somehow clear out some space manually; if the disk is full, the disk is full, and PostgreSQL doesn't remove WAL files without being able to write files first. Manual (or scripted) intervention is always necessary if you reach disk 100% full. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 6/29/11 11:11 AM, Robert Haas wrote: > If the standby gets far enough behind the master that the required > files are no longer there, then it will switch to the archive, if > available. One more thing: As I understand it (and my testing shows this), the standby *prefers* the archive logs, and won't switch to streaming until it reaches the end of the archive logs. This is desirable behavior, as it minimizes the load on the master. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Wed, Jun 29, 2011 at 9:54 PM, Josh Berkus <josh@agliodbs.com> wrote: >> I am not sure exactly how walreceiver handles it if the disk is full. >> I assume it craps out and eventually retries, so probably what will >> happen is that, after the standby's pg_xlog directory fills up, >> walreceiver will sit there and error out until replay advances enough >> to remove a WAL file and thus permit some more data to be streamed. > > Nope, it gets stuck and stops there. Replay doesn't advance unless you > can somehow clear out some space manually; if the disk is full, the disk > is full, and PostgreSQL doesn't remove WAL files without being able to > write files first. > > Manual (or scripted) intervention is always necessary if you reach disk > 100% full. Wow, that's a pretty crappy failure mode... but I don't think we need to fix it just on account of this patch. It would be nice to fix, of course. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 29, 2011 at 11:14 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jun 15, 2011 at 1:58 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> After we run "pg_ctl promote", time-delayed replication should be disabled? >> Otherwise, failover might take very long time when we set recovery_time_delay >> to high value. > > PFA a patch that I believe will disable recovery_time_delay after > promotion. The only change from the previous version is: > > diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog > index 1dbf792..41b3ae9 100644 > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -5869,7 +5869,7 @@ pg_is_xlog_replay_paused(PG_FUNCTION_ARGS) > static void > recoveryDelay(void) > { > - while (1) > + while (!CheckForStandbyTrigger()) > { > long secs; > int microsecs; Thanks for updating patch! I have a few comments; ISTM recoveryDelayUntilTime needs to be calculated also when replaying the commit *compact* WAL record (i.e., record_info == XLOG_XACT_COMMIT_COMPACT). When the user uses only two-phase commit on the master, ISTM he or she cannot use this feature. Because recoveryDelayUntilTime is never set in that case. Is this intentional? We should disable this feature also after recovery reaches the stop point (specified in recovery_target_xxx)? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Jun 30, 2011 at 10:56 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> Nope, it gets stuck and stops there. Replay doesn't advance unless you >> can somehow clear out some space manually; if the disk is full, the disk >> is full, and PostgreSQL doesn't remove WAL files without being able to >> write files first. >> >> Manual (or scripted) intervention is always necessary if you reach disk >> 100% full. > > Wow, that's a pretty crappy failure mode... but I don't think we need > to fix it just on account of this patch. It would be nice to fix, of > course. Yeah, we need to fix that as a separate patch. The difficult point is that we cannot delete WAL files until we replay the checkpoint record and restartpoint occurs. But, if the disk is full, there would be no space to receive the checkpoint record, so we cannot WAL files infinitely. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Jun 30, 2011 at 12:14 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > We should disable this feature also after recovery reaches the stop > point (specified in recovery_target_xxx)? Another comment; it's very helpful to document the behavior of delayed standby when promoting or after reaching the stop point. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Jun 30, 2011 at 2:56 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jun 29, 2011 at 9:54 PM, Josh Berkus <josh@agliodbs.com> wrote: >>> I am not sure exactly how walreceiver handles it if the disk is full. >>> I assume it craps out and eventually retries, so probably what will >>> happen is that, after the standby's pg_xlog directory fills up, >>> walreceiver will sit there and error out until replay advances enough >>> to remove a WAL file and thus permit some more data to be streamed. >> >> Nope, it gets stuck and stops there. Replay doesn't advance unless you >> can somehow clear out some space manually; if the disk is full, the disk >> is full, and PostgreSQL doesn't remove WAL files without being able to >> write files first. >> >> Manual (or scripted) intervention is always necessary if you reach disk >> 100% full. > > Wow, that's a pretty crappy failure mode... but I don't think we need > to fix it just on account of this patch. It would be nice to fix, of > course. How is that different to running out of space in the main database? If I try to pour a pint of milk into a small cup, I don't blame the cup. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jun 29, 2011 at 7:11 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I don't really see how that's any different from what happens now. If > (for whatever reason) the master is generating WAL faster than a > streaming standby can replay it, then the excess WAL is going to pile > up someplace, and you might run out of disk space. Time-delaying the > standby creates an additional way for that to happen, but I don't > think it's an entirely new problem. The only way to control this is with a time delay that can be changed while the server is running. A recovery.conf parameter doesn't allow that, so another way is preferable. I think the time problems are more complex than said. The patch relies upon transaction completion times, but not all WAL records have a time attached to them. Plus you only used commits anyway, not sure why. Some actions aren't even transactional, such as DROP DATABASE, amongst others. Consecutive records can be hours apart, so it would be possible to delay on some WAL records but then replay records that happened minutes ago, then wait hours for the next apply. So this patch doesn't do what it claims in all cases. Similar discussion on max_standby_delay covered exactly that ground and went on for weeks in 9.0. IIRC I presented the same case you just did and we agreed in the end that was not acceptable. I'm not going to repeat it. Please check the archives. So, again +1 for the feature, but -1 for the currently proposed implementation, based upon review. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 6/30/11 2:00 AM, Simon Riggs wrote: >>> Manual (or scripted) intervention is always necessary if you reach disk >>> >> 100% full. >> > >> > Wow, that's a pretty crappy failure mode... but I don't think we need >> > to fix it just on account of this patch. It would be nice to fix, of >> > course. > How is that different to running out of space in the main database? > > If I try to pour a pint of milk into a small cup, I don't blame the cup. I have to agree with Simon here. ;-) We can do some things to make this easier for administrators, but there's no way to "solve" the problem. And the things we could do would have to be advanced optional modes which aren't on by default, so they wouldn't really help the DBA with poor planning skills. Here's my suggestions: 1) Have a utility (pg_archivecleanup?) which checks if we have more than a specific settings's worth of archive_logs, and breaks replication and deletes the archive logs if we hit that number. This would also require some way for the standby to stop replicating *without* becoming a standalone server, which I don't think we currently have. 2) Have a setting where, regardless of standby_delay settings, the standby will interrupt any running queries and start applying logs as fast as possible if it hits a certain number of unapplied archive logs.Of course, given the issues we had with standby_delay,I'm not sure I want to complicate it further. I think we've already fixed the biggest issue in 9.1, since we now have a limit on the number of WALs the master will keep if archiving is failing ... yes? That's the only big *avoidable* failure mode we have, where a failing standby effectively shuts down the master. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Thu, Jun 30, 2011 at 1:00 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 6/30/11 2:00 AM, Simon Riggs wrote: >>>> Manual (or scripted) intervention is always necessary if you reach disk >>>> >> 100% full. >>> > >>> > Wow, that's a pretty crappy failure mode... but I don't think we need >>> > to fix it just on account of this patch. It would be nice to fix, of >>> > course. >> How is that different to running out of space in the main database? >> >> If I try to pour a pint of milk into a small cup, I don't blame the cup. > > I have to agree with Simon here. ;-) > > We can do some things to make this easier for administrators, but > there's no way to "solve" the problem. And the things we could do would > have to be advanced optional modes which aren't on by default, so they > wouldn't really help the DBA with poor planning skills. Here's my > suggestions: > > 1) Have a utility (pg_archivecleanup?) which checks if we have more than > a specific settings's worth of archive_logs, and breaks replication and > deletes the archive logs if we hit that number. This would also require > some way for the standby to stop replicating *without* becoming a > standalone server, which I don't think we currently have. > > 2) Have a setting where, regardless of standby_delay settings, the > standby will interrupt any running queries and start applying logs as > fast as possible if it hits a certain number of unapplied archive logs. > Of course, given the issues we had with standby_delay, I'm not sure I > want to complicate it further. > > I think we've already fixed the biggest issue in 9.1, since we now have > a limit on the number of WALs the master will keep if archiving is > failing ... yes? That's the only big *avoidable* failure mode we have, > where a failing standby effectively shuts down the master. I'm not sure we changed anything in this area for 9.1. Am I wrong? wal_keep_segments was present in 9.0. Using that instead of archiving is a reasonable way to bound the amount of disk space that can get used, at the cost of possibly needing to rebuild the standby if things get too far behind. Of course, in any version, you could also use an archive_command that will remove old files to make space if the disk is full, with the same downside: if the standby isn't done with those files, you're now in for a rebuild. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 30, 2011 at 6:45 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > The only way to control this is with a time delay that can be changed > while the server is running. A recovery.conf parameter doesn't allow > that, so another way is preferable. True. We've talked about making the recovery.conf parameters into GUCs, which would address that concern (and some others). > I think the time problems are more complex than said. The patch relies > upon transaction completion times, but not all WAL records have a time > attached to them. Plus you only used commits anyway, not sure why. For the same reason we do that with the recovery_target_* code - replaying something like a heap insert or heap update doesn't change the user-visible state of the database, because the records aren't visible anyway until the commit record is replayed. > Some actions aren't even transactional, such as DROP DATABASE, amongst Good point. We'd probably need to add a timestamp to the drop database record, as that's a case that people would likely want to defend against with this feature. > others. Consecutive records can be hours apart, so it would be > possible to delay on some WAL records but then replay records that > happened minutes ago, then wait hours for the next apply. So this > patch doesn't do what it claims in all cases. > > Similar discussion on max_standby_delay covered exactly that ground > and went on for weeks in 9.0. IIRC I presented the same case you just > did and we agreed in the end that was not acceptable. I'm not going to > repeat it. Please check the archives. I think this case is a bit different. First, max_standby_delay is relevant for any installation using Hot Standby, whereas this is a feature that specifically involves time. Saying that you have to have time synchronization for Hot Standby to work as designed is more of a burden than saying you need time synchronization *if you want to use the time-delayed recovery feature*. Second, and maybe more importantly, no one has come up with an idea for how to make this work reliably in the presence of time skew. Perhaps we could provide a simple time-skew correction feature that would work in the streaming case (though probably not nearly as well as running ntpd), but as I understand your argument, you're saying that most people will want to use this with archiving. I don't see how to make that work without time synchronization. In the max_standby_delay case, the requirement is that queries not get cancelled too aggressively while at the same time letting the standby get too far behind the master, which leaves some flexibility in terms of how we actually make that trade-off, and we eventually found a way that didn't require time synchronization, which was an improvement. But for a time-delayed standby, the requirement at least AIUI is that the state of the standby lag the master by a certain time interval, and I don't see any way to do that without comparing slave timestamps with master timestamps. If we can find a similar clever trick here, great! But I'm not seeing how to do it. Now, another option here is to give up on the idea of a time-delayed standby altogether and instead allow the standby to lag the master by a certain number of WAL segments or XIDs. Of course, if we do that, then we will not have a feature called "time-delayed standbys". Instead, we will have a feature called "standbys delayed by a certain number of WAL segments (or XIDs)". That certainly caters to some of the same use cases, but I think it severely lacking in the usability department. I bet the first thing most people will do is to try to figure out how to translate between those metrics and time, and I bet we'll get complaints on systems where the activity load is variable and therefore the time lag for a fixed WAL-segment lag or XID-lag is unpredictable. So I think keeping it defined it terms of time is the right way forward, even though the need for external time synchronization is, certainly, not ideal. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 6/30/11 10:25 AM, Robert Haas wrote: > So I think keeping it defined it terms of time is the > right way forward, even though the need for external time > synchronization is, certainly, not ideal. Actually, when we last had the argument about time synchronization, Kevin Grittner (I believe) pointed out that unsynchronized replication servers have an assortment of other issues ... like any read query involving now(). As the person who originally brought up this hurdle, I felt that his argument defeated mine. Certainly I can't see any logical way to have time delay in the absence of clock synchronization of some kind. Also, I kinda feel like this discussion seems aimed at overcomplicating a feature which only a small fraction of our users will ever use. Let's keep it as simple as possible. As for delay on streaming replication, I'm for it. I think that post-9.1, thanks to pgbasebackup, the number of our users who are doing archive log shipping is going to drop tremendously. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> wrote: > when we last had the argument about time synchronization, > Kevin Grittner (I believe) pointed out that unsynchronized > replication servers have an assortment of other issues ... like > any read query involving now(). I don't remember making that point, although I think it's a valid one. What I'm sure I pointed out is that we have one central router which synchronizes to a whole bunch of atomic clocks around the world using the normal "discard the outliers and average the rest" algorithm, and then *every singe server and workstation on our network synchronizes to that router*. Our database servers are all running on Linux using ntpd. Our monitoring spams us with email if any of the clocks falls outside nominal bounds. (It's been many years since we had a misconfigured server which triggered that.) I think doing anything in PostgreSQL around this beyond allowing DBAs to trust their server clocks is insane. The arguments for using and trusting ntpd is pretty much identical to the arguments for using and trusting the OS file systems. -Kevin
Kevin, > I think doing anything in PostgreSQL around this beyond allowing > DBAs to trust their server clocks is insane. The arguments for > using and trusting ntpd is pretty much identical to the arguments > for using and trusting the OS file systems. Oh, you don't want to implement our own NTP? Coward! ;-) -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Thu, Jun 30, 2011 at 1:51 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > I think doing anything in PostgreSQL around this beyond allowing > DBAs to trust their server clocks is insane. The arguments for > using and trusting ntpd is pretty much identical to the arguments > for using and trusting the OS file systems. Except that implementing our own file system would likely have more benefit and be less work than implementing our own time synchronization, at least if we want it to be reliable. Again, I am not trying to pretend that this is any great shakes. MySQL's version of this feature apparently does somehow compensate for time skew, which I assume must mean that their replication works differently than ours - inter alia, it probably requires a TCP socket connection between the servers. Since we don't require that, it limits our options in this area, but also gives us more options in other areas. Still, if I could think of a way to do this that didn't depend on time synchronization, then I'd be in favor of eliminating that requirement. I just can't; and I'm inclined to think it isn't possible. I wouldn't be opposed to having an option to try to detect time skew between the master and the slave and, say, display that information in pg_stat_replication. It might be useful to have that data for monitoring purposes, and it probably wouldn't even be that much code. However, I'd be a bit hesitant to use that data to "correct" the amount of time we spend waiting for time-delayed replication, because it would doubtless be extremely imprecise compared to real time synchronization, and considerably more error-prone. IOW, what you said. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jul 1, 2011 at 2:25 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Some actions aren't even transactional, such as DROP DATABASE, amongst > > Good point. We'd probably need to add a timestamp to the drop > database record, as that's a case that people would likely want to > defend against with this feature. This means that recovery_target_* code would also need to deal with DROP DATABASE case. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Jul 1, 2011 at 3:25 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jun 30, 2011 at 1:51 PM, Kevin Grittner > <Kevin.Grittner@wicourts.gov> wrote: >> I think doing anything in PostgreSQL around this beyond allowing >> DBAs to trust their server clocks is insane. The arguments for >> using and trusting ntpd is pretty much identical to the arguments >> for using and trusting the OS file systems. > > Except that implementing our own file system would likely have more > benefit and be less work than implementing our own time > synchronization, at least if we want it to be reliable. > > Again, I am not trying to pretend that this is any great shakes. > MySQL's version of this feature apparently does somehow compensate for > time skew, which I assume must mean that their replication works > differently than ours - inter alia, it probably requires a TCP socket > connection between the servers. Since we don't require that, it > limits our options in this area, but also gives us more options in > other areas. Still, if I could think of a way to do this that didn't > depend on time synchronization, then I'd be in favor of eliminating > that requirement. I just can't; and I'm inclined to think it isn't > possible. > > I wouldn't be opposed to having an option to try to detect time skew > between the master and the slave and, say, display that information in > pg_stat_replication. It might be useful to have that data for > monitoring purposes, and it probably wouldn't even be that much code. > However, I'd be a bit hesitant to use that data to "correct" the > amount of time we spend waiting for time-delayed replication, because > it would doubtless be extremely imprecise compared to real time > synchronization, and considerably more error-prone. IOW, what you > said. I agree with Robert. It's difficult to implement time-synchronization feature which can deal with all the cases, and I'm not sure if that's really worth taking our time. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao <masao.fujii@gmail.com> writes: > On Fri, Jul 1, 2011 at 2:25 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> Some actions aren't even transactional, such as DROP DATABASE, amongst >> >> Good point. We'd probably need to add a timestamp to the drop >> database record, as that's a case that people would likely want to >> defend against with this feature. > > This means that recovery_target_* code would also need to deal with > DROP DATABASE case. > there is no problem if you use "restore point" names... but of course you lose flexibility (ie: you can't restore to 5 minutes before now) mmm... a lazy idea: can't we just create a restore point wal record *before* we actually drop the database? then we won't need to modify logic about recovery_target_* (if it is only DROP DATABASE maybe that's enough about complicating code) and we can provide that protection since 9.1 -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL Soporte 24x7, desarrollo, capacitación y servicios
On Thu, Jun 30, 2011 at 6:25 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I think the time problems are more complex than said. The patch relies >> upon transaction completion times, but not all WAL records have a time >> attached to them. Plus you only used commits anyway, not sure why. > > For the same reason we do that with the recovery_target_* code - > replaying something like a heap insert or heap update doesn't change > the user-visible state of the database, because the records aren't > visible anyway until the commit record is replayed. > >> Some actions aren't even transactional, such as DROP DATABASE, amongst > > Good point. We'd probably need to add a timestamp to the drop > database record, as that's a case that people would likely want to > defend against with this feature. > >> others. Consecutive records can be hours apart, so it would be >> possible to delay on some WAL records but then replay records that >> happened minutes ago, then wait hours for the next apply. So this >> patch doesn't do what it claims in all cases. You misread my words above, neglecting the "amongst others" part. I don't believe you'll be able to do this just by relying on timestamps on WAL records because not all records carry timestamps and we're not going to add them just for this. It's easier to make this work usefully using pg_standby. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services